Logo
Logo

Programming by Stealth

A blog and podcast series by Bart Busschots & Allison Sheridan.

Instalment 169 of X — Advanced YAML Topics

In the previous instalment, we explored some reasons why we might be interested in learning YAML, and its uses and philosophy before we shifted focus to the basic syntax. In addition to that basic syntax, there are two important additional concepts we didn’t cover — multi-line strings, and the more efficient optional syntactic sugar for simple nested elements. In fact, we ended the instalment on a teaser about a better way to represent simple arrays of arrays, etc. We’ll start this instalment with these two additional elements of the syntax.

Because we’ve just finished exploring the jq command and language, it seems apt to finish this instalment with a brief overview of the very much related yq command which uses the jq language to query data represented in a wide range of formats, including YAML. This means we can use the language we just learned for processing JSON files to process YAML, and, to translate between YAML, JSON, and many other formats!

Matching Podcast Episode

You can also Download the MP3

Read an unedited, auto-generated transcript with chapter marks: PBS_2024_07_06

Instalment Resources

Multi-Line Strings (String Blocks)

YAML provides many ways to render multi-line strings, and they ensure you can sanely indent your markup without having unintended consequences for the contents of your strings.

The full detail of every possible way in which you can type a multi-line string into a YAML file is simply too much for this quick overview, but if you want to dig much deeper I recommend this excellent Stack Overflow answer.

Flow Style Multi-Line Strings (Avoid!)

I’m going to start with the simplest of the syntaxes to read, and while you will definitely encounter it in lots of YAML written by others — I strongly advise against using it in your own YAML files!

If you just type a string over multiple lines without quoting it or using any kind of operator it will probably just work. This is what the specification calls the flow style. Superficially, these behave quite sensibly, but the devil is in the details, because they have some nasty pitfalls!

Before we go any further, here’s some YAML with example flow style multi-line strings:

---
Flow Style String 1: This is a string in the flow style
  within a mapping, so it uses indentation.
  
  This is a second paragraph within a flow style string.
Flow Style String 2:
  
  This is also a flow style string with two paragraphs
  
  But it has padding lines around it

Some Other Key: 42

This converts to the following JSON markup:

{
  "Flow Style String 1": "This is a string in the flow style within a mapping, so it uses indentation.\nThis is a second paragraph within a flow style string.",
  "Flow Style String 2": "This is also a flow style string with two paragraphs\nBut it has padding lines around it",
  "Some Other Key": 42
}

Notice that YAML’s default behaviour was indeed quite sensible:

  1. The padding was silently absorbed so it didn’t mess up either of the multi-line strings
  2. The single internal newline character between flow style and ` within a mapping` was replaced with a space.
  3. The double-newline characters separating the paragraphs were collapsed into a single newline character (you may or may not like that behaviour).
  4. The padding lines above and below the second multi-line string were ignored.

That does (mostly) seem very sensible, but, remember I mentioned there were pitfalls? I’ll defer to the previously recommended stack overflow answer for a summary:

… no escaping, no ` # or : ` combinations, first character can’t be ", ' or many other punctuation characters …

So, if you use these strings you need to be constantly on the alert for special characters which will introduce subtle bugs into your data. That’s a lot of mental load to place on yourself, and since we’re all human it makes it inevitable that we’ll corrupt our own data every now and then. So, my advice is to avoid the flow style for strings and use string blocks instead.

String Blocks (Always Use)

String blocks are extremely powerful, but without the proper context, they can be extremely confusing. To understand why the syntax is what it is, it helps to understand the problems they are intended to solve.

The obvious thing we need handled is indentation, but all the permutations of string blocks do that, so we can take that as a given.

So, what aspects of a string’s representation do we need to assert control over?

  1. How internal newline characters are handled
  2. How trailing white space is handled

In YAML’s world view, you control the handling of internal newline characters by choosing a block operator, and you control the handling of trailing white space with an optional chomp indicator.

You start your multi-line strings with one of the two string block operators, which is optionally followed by a chomp indicator, and then you start the content of your string on the next line.

So, let’s start by describing the two block operators:

String Block Operator Syntax Description
The literal operator | New line characters are preserved and the indentation at the start of every new line is ignored. Allows multi-line strings to be indented.
The fold operator > New line characters and the indentation at the start of each new line are collapsed into a single space. Allows a single line string to be written across multiple lines.

Now, let’s describe the optional chomp indicators:

Chomp Indicator Syntax Description
The strip indicator - All trailing white space (including newline characters) is removed.
The default clip behaviour   When no chomp indicator is added, YAML clips the trailing white space by replacing it all with a single trialing newline character.
The keep indicator + All trailing white space (including newline characters) is retained.

That sounds complicated, but let’s look at an example YAML document that defines an array with multiple copies of the same string, each using a different combination of block operator and chomp indicator:

---
# literal operator with default clip behaviour
- |
  Hello
  World!
  
# fold operator with default clip behaviour
- >
  Hello
  World!
  
# literal operator with strip chomp indicator
- |-
  Hello
  World!
  
# fold operator with strip chomp indicator
- >-
  Hello
  World!
  
# literal operator with keep chomp indicator
- |+
  Hello
  World!
  
# fold operator with keep chomp indicator
- >+
  Hello
  World!
  

Converting this to JSON we get the following array:

[
  "Hello\nWorld!\n",
  "Hello World!\n",
  "Hello\nWorld!",
  "Hello World!",
  "Hello\nWorld!\n\n",
  "Hello World!\n\n"
]

Let’s map those strings to the operators that produced them to better understand what happened:

Operator Generated String Explanation
| "Hello\nWorld!\n" The default literal behaviour is to ignore the indentation and keep just one trailing newline character.
> "Hello World!\n" The default fold behaviour is to collapse the newlines and the indentation into a single space, and to keep just one trailing newline character.
|- "Hello\nWorld!" The literal operator with strip indicator ignores the indentation and all the trailing white space.
>- "Hello World!" The fold operator with the strip indicator collapses the newlines and the indentation into a single space, and, strips all the trailing white space.
|+ "Hello\nWorld!\n\n" The literal operator with the keep indicator ignores only the indentation, it keeps all the newline characters, including the empty line at the end of the block.
>+ "Hello World!\n\n" The fold operator with the keep indicator collapses the internal newline characters and indentation into single spaces, but keeps all the trailing white space.

Flow Style Sequences & Mappings (JSON-Like)

At the end of the previous instalment I teased that YAML provides an optional more compacted syntax for adding simple sequences and mappings. The example I used in the tease was this simply nested array:

---
- 1
- 
  - 2
  - 3

That YAML looks very clunky, especially when you compare it to the compact notation you’d use in a JSON file, which is simply [1,[2,3]].

This is where YAML’s optional flow style for sequences and mappings comes in. YAML has basically learned from JSON, so it supports a JSON-like syntax that’s actually much more forgiving than true JSON but will be instantly familiar to anyone with JSON experience. What makes it more flexible? In YAML, unless there’s some kind of ambiguity, neither key names nor strings need to be quoted. We can illustrate this with two simple YAML documents:

---
{Jim Kirk: To boldy go, Jean Luc Picard: Make it so!}
---
[this is a string, so is this!]

Converting these two documents to JSON we get the following dictionary and array:

{
  "Jim Kirk": "To boldy go",
  "Jean Luc Picard": "Make it so!"
}
[
  "this is a string",
  "so is this!"
]

The reason we could get away without any kind of quoting is that the keys in the mapping didn’t contain colons, and the strings didn’t contain commas or other characters with meaning in the the flow syntax like ] and }.

While you can avoid quoting surprisingly often, you can’t always avoid it. For example, the following does not work as expected:

---
{William Shakespeare: To be or not to be, that is the question}

Converting the above to JSON we do not get what we expect:

{
  "William Shakespeare": "To be or not to be",
  "that is the question": null
}

Why? Because the comma ends the value for the key William Shakespeare and starts a new key named that is the question with no value at all, which is interpreted as null.

The solution is of course to quote Shakespeare’s excerpt:

---
{William Shakespeare: 'To be or not to be, that is the question'}

This now converts to the expected JSON dictionary:

{
  "William Shakespeare": "To be or not to be, that is the question"
}

You can of course mix and match the regular and flow styles, which is in fact how you usually see the flow style used.

We can illustrate this with two YAML documents, one that represents a dictionary of arrays, and one that represents an array of dictionaries:

---
Monday: [Mon, mon, M]
Tuesday: [Tue, tue, Tu]
Wednesday: [Wed, wed, W]
Thursday: [Thur, thur, Th]
Friday: [Fri, fri, F]
Saturday: [Sat, sat, Sa]
Sunday: [Sun, sun, Su]
---
- {Name: Bob, Email: bob@burgers.com}
- {Name: Ken, Email: ken@barbie.com}

Converting these to JSON we get the following:

{
  "Monday": [
    "Mon",
    "mon",
    "M"
  ],
  "Tuesday": [
    "Tue",
    "tue",
    "Tu"
  ],
  "Wednesday": [
    "Wed",
    "wed",
    "W"
  ],
  "Thursday": [
    "Thur",
    "thur",
    "Th"
  ],
  "Friday": [
    "Fri",
    "fri",
    "F"
  ],
  "Saturday": [
    "Sat",
    "sat",
    "Sa"
  ],
  "Sunday": [
    "Sun",
    "sun",
    "Su"
  ]
}
[
  {
    "Name": "Bob",
    "Email": "bob@burgers.com"
  },
  {
    "Name": "Ken",
    "Email": "ken@barbie.com"
  }
]

Notice how much simpler the YAML is now!

Before we mixed and matched the the regular and flow styles of YAML in a sensible way our YAML sometimes came out more complicated looking than the equivalent JSON, but now that we can mix-and-match the two YAML syntax styles as desired, YAML really delivers on its promise of being more concise and more human readable & writeable!

Using yq to Query YAML Files

As mentioned in the previous instalment, I used the command line tool yq to convert the YAML snippets in the show notes to JSON.

This tool has some very lofty goals — the aim being to allow any of the common data formats to be queried using the jq syntax. As we produice this instalment (summer 2024), yq implements all the commonly used features of the jq language and supports the most popular data formats, including YAML, JSON, XML & CSV. Work is on-going to add better TOML support, and to add support for the remaining jq language features.

TOML is another config file language that feels like it might be on the cusp of a breakthrough. It stands for Tom’s Obvious Minimal Language, and it can best be described as a modern take on the old INI syntax used by Windows before the accursed Registry was added to replace config files:

# This is a TOML document

title = "TOML Example"

[owner]
name = "Tom Preston-Werner"
dob = 1979-05-27T07:32:00-08:00

[database]
enabled = true
ports = [ 8000, 8001, 8002 ]
data = [ ["delta", "phi"], [3.14] ]
temp_targets = { cpu = 79.5, case = 72.0 }

[servers]

[servers.alpha]
ip = "10.0.0.1"
role = "frontend"

[servers.beta]
ip = "10.0.0.2"
role = "backend"

One of the things that makes yq appealing is that it’s written from the ground up in the modern Rust programming language, so it benefits from the secure-by-design features built into Rust, and it has no dependencies! You can get all the details on their GitHub page, and on the Mac, you can install yq via Homebrew with the command:

brew install yq

To start experimenting with yq we’ll use a snapshot of the YAML config file for this site on GitHub. I used the command below to download it and save it into the installment ZIP as _config.yml:

curl https://raw.githubusercontent.com/bartificer/programming-by-stealth/4834603c3837c11b57bb641074f720c2cce6976a/docs/_config.yml -o _config.yml

Let’s start simple — like the jq command the yq command expects the first argument to be a jq filter, and the second argument to be a file to read from. If you pipe into into the command then you don’t pass a second argument, so the following two commands do the same thing:

# apply a jq filter to a YAML file
yq '.' some_file.yaml

# apply a jq filter to some YAML piped to STDIN
curl 'http://somesite.domain/someYAMLAPI?something=blah' | yq '.'

Because yq uses the jq language, the filters are literally the same as those we learned about in our exploration of jq (starting at instalment 155), so as a simple example, let’s extract the site title from our config file.

This is the entire file:

#theme: jekyll-theme-cayman
remote_theme: bartificer/bartificer-jekyll-theme
plugins:
  - jekyll-remote-theme
title: Programming by Stealth
email: podcasting@bartificer.net
description: >- # this means to ignore newlines until "baseurl:"
  A blog and podcast series by Bart Busschots & Allison Sheridan.
baseurl: "" # the subpath of your site, e.g. /blog
url: "" # the base hostname & protocol for your site, e.g. http://example.com
github_username:  bartificer

# theme-specific options
accent:
  color: "#00408d"
  # color_light: "#fbecd6"
  color_light: "#e6f1ff"
  font_family: '"Ubuntu Mono", mono'
nav_items:
  - url: https://bartb.ie/pbsindex
    icon: "fas fa-search"
    text: "PBS Index"
  - url: https://www.podfeet.com/blog/category/programming-by-stealth/
    icon: "fas fa-podcast"
    text: "Podcast Episodes"
community:
  url: https://podfeet.com/slack
  description: Find us in the PBS channel on the Podfeet Slack.
  icon: "fab fa-slack"
  labels:
    button: "Podfeet Slack"

The first thing to note is, as is normal in config files, the top-level element is a YAML mapping, or, more generically, a dictionary.

As a first example, let’s extract the site’s title, which is in the top-level key named title:

yq '.title' _config.yml

This outputs the site title in YAML format, which is just a bare string:

Programming by Stealth

This is more obvious if we ask for an array, say the text for each navigation item (the text keys in the dictionaries inside the top-level array nav_items):

yq '.nav_items | map(.text)' _config.yml

This outputs the following YAML sequence:

- "PBS Index"
- "Podcast Episodes"

As you can see, we can use the standard jq syntax we already know to query this YAML file. To underline this point one more time, the following command tells us how many Jekyll plugins the site needs (beyond those that GitHub Pages bundles by default) by chaining two jq filters together, one to get the plugins array from the top-level dictionary, and one containing just the length function:

yq '.plugins | length' _config.yml
# outputs 1

Crossing the Streams

Note that yq intends to be a cross-format tool, so, it’s perfectly happy querying a YAML file and outputting the result in JSON format! You can do this with the --output-format=json option or its less verbose short version -o=j. As an example, the following command queries the YAML config file for a list of all navigation items, and outputs the result in JSON format:

yq '.nav_items' -o=j _config.yml

This outputs the following JSON:

[
  {
    "url": "https://bartb.ie/pbsindex",
    "icon": "fas fa-search",
    "text": "PBS Index"
  },
  {
    "url": "https://www.podfeet.com/blog/category/programming-by-stealth/",
    "icon": "fas fa-podcast",
    "text": "Podcast Episodes"
  }
]

We can of course go the other way too, querying a JSON file and outputting YAML!

As an example, let’s use yq to query our Nobel Prizes JSON data set for the names of all laureates as a YAML array:

yq '[ .prizes[] | .laureates[]? | [.firstname, .surname?] | join(" ") ] | sort' -o=y NobelPrizes.json

This outputs a big long list of laureates starting with:

- A. Michael Spence
- Aage N. Bohr
- Aaron Ciechanover
- Aaron Klug
- Abdulrazak Gurnah
- Abdus Salam
- Abhijit Banerjee
- Abiy Ahmed Ali
- Ada E. Yonath
- Adam G. Riess

We are of course not limited to converting JSON to YAML, we can also work with CSVs this way, for example, we can output the list of navigation items from our config file as CSV with the command:

yq '.nav_items' _config.yml -o=csv

This produces the following CSV:

url,icon,text
https://bartb.ie/pbsindex,fas fa-search,PBS Index
https://www.podfeet.com/blog/category/programming-by-stealth/,fas fa-podcast,Podcast Episodes

And Much More …

This is just a glimpse of what yq can do. Unfortunately, yq does not come with a traditional POSIX man page, but you can get a similar listing of all the supported arguments, options & flags with the command:

yq help

There’s also an official documentation site that contains deeper explanations and more examples, but I’ve found that it’s sometimes a little behind in terms of the full list of supported features and options — that is to say, the actual command can do a little bit more than the current official documentation site says it can!

Final Thoughts

At the very least I hope you’ll now recognise YAML when you come across it in a config file or some front matter, and that you’ll now be able edit these files with newfound confidence. At best, I hope you’ll give serious consideration to using YAML as your configuration file format on any new software projects you embark on!

We’ll be shifting gears quite dramatically again for the next two instalments, starting with an exploration of what the Model View Controller, or MVC software design pattern is, the problems it solves, and why it was such a good fit for the new XKPasswd web interface being developed by the community. That will be followed by another guest appearance from the leading light in the XKPasswd re-implementation, Helma van der Linden!

Join the Community

Find us in the PBS channel on the Podfeet Slack.

Podfeet Slack