PBS 168 of X: An Introduction (YAML)

22 Jun 2024

At first glance it might be surprising to see this series take a short diversion into a language we’ve never mentioned before called YAML, why are we doing that?

The driving inspiration for this series for the past few years has been supporting the re-design of XKpasswd by the community, and that has resulted in a substantial shift in perspective. We started the series many years ago with a tight focus on programming, but with the XKPasswd project came a big expansion in our focus. We’ve widened our scope from just programming to software development generally. What’s the difference? Programmers just write code, while developers build and ship software. Building and shipping software requires a huge array of developer tools, and that’s where our focus has been for the past few years.

(Note there is another even wider field we’re continuing to ignore for now — software engineering!)

We started with more obviously developer-focused tools like the Node Package Manager (PBS 127) & JavaScript Modules (128), linters (ES Lint, 129), API documentation generators (JSDoc, 130+), test suites (Jest, 135 & 136), bundlers (Webpack, 137+). But then we broadened out into more generally useful tools that are also important for efficient software development, starting with programmatic diagramming with Mermaid (PBS 141), then shell scripting with Bash (143+), and finally, jq for processing JSON data (155+) from the terminal and in shell scripts.

The connection to YAML is a little more tenuous, but only a little. Like JSON, YAML is now a very widely adopted data format, and as best as I can tell, it’s surpassing JSON as the go-to language new projects are using for their configuration files. YAML has also made a home for itself in content creation with it’s widespread use as a metadata format for Markdown content.

For now, none of the config files in the XKPasswd Git repo are in YAML, but I doubt that will remain the case. Also, this series already has a direct, if somewhat meta, intersection with YAML — these notes are published using GitHub Pages which is implemented with the open source Jekyll static site generator, and Jekyll that makes very heavy use of YAML for configuration, data storage, and metadata.

Ultimately I think there are three good reasons to dedicate two instalments in this series to YAML:

I have a gut feeling it’s going to become a must-have skill for developers — as soon as I saw my first JSON file I just knew JSON was going to become a big deal, and the first time I understood what YAML is and how it works, I had the exact same feeling
I’m pretty sure we’re going to be using at least one developer tool that leverages YAML before the XKPasswd project is out
I really like it, and I want to share my love of it with this audience 🙂

This is going to be a two-part mini-series, starting with why there was a need for something like YAML to be developed, YAML’s philosophy, and the basic syntax. Then, in the second instalment we’ll look at some more advanced aspects of YAML’s syntax, and we’ll explore yq a very cool command line tool inspired by jq which can process YAML files using the jq query language!

Matching Podcast Episode

Listen along to this instalment on episode 168 of the Chit Chat Across the Pond Podcast.

You can also Download the MP3

Read an unedited, auto-generated transcript with chapter marks: CCATP_2024_06_22

The Big Picture

YAML is a language for storing data in a simple, human-readable form that can also be reliably processed by computers. It reminds me a lot of Markdown which is designed to be human readable and obvious as-is, but can also be read by computers and converted into other formats like HTML for presentation. YAML wants to be the Markdown for data!

To give you a taste, the following is a snippet of the YAML configuration file for this site:

remote_theme: bartificer/bartificer-jekyll-theme
title: Programming by Stealth
email: podcasting@bartificer.net
description: A blog and podcast series by Bart Busschots & Allison Sheridan.
baseurl: "" # the subpath of your site, e.g. /blog
url: ""
github_username:  bartificer

As we’ll learn later, YAML was intentionally designed to be very generic and to support all kinds of use cases, but I see its use really taking off in two areas.

Firstly, it’s becoming very common to use YAML for configuration files. Because it’s so human-friendly, you can just give people a template config to fill out and they’ll get it right without even realising it is YAML. It’s much less fiddly than JSON, which won’t parse if you make even a small mistake, and it’s infinitely more readable and usable than XML.

Secondly, so-called front matter has become the defacto technique for embedding metadata within Markdown files, and those front matter blocks are in fact YAML.

As mentioned already, this site is built using the Jekyll static site generator. The actual contents of the site is contained in the docs folder of the PBS Git repo, and in there you’ll find a YAML configuration file (_config.yml), some other resources, and lots and lots of Markdown files, one for each instalment, some of which have YAML front matter.

Finally, I’ve maintained a personal knowledge base for years. It’s a collection of notes describing various things I learn so I can refer back to them later. It started off as an EverNote library, then as their app lost focus and became worse and worse bloatware. I briefly migrated to Joplin, largely because it’s open source, and stores text as Markdown, but that proved a dead-end because their Mac app suffers badly from by nerds-for-nerdy syndrome, and I just couldn’t handle spending so much time in such a poor UI!

Finally, I’ve landed on another open source option that lets me bring my own back-end — Obsidian back-ended by a private GitHub repo! In effect, Obsidian is a private static site generator — what you see in the UI is a well-organised collection of notes enriched with lots of metadata, but what you actually have in that Git repo is nothing more than a folder of Markdown files with YAML front matter! This means that if I ever get fed up with Obsidian, my notes and their metadata are safe — they’re just completely human-readable files in a Git repo I control, so I can’t lose my work!

Some Historical Context

Before we go any further I want to set the scene so we a see YAML in its larger historical context. YAML didn’t spontaneously emerge from a vacuum — it was built to provide a better alternative to what had gone before. If you think about it, the problem YAML solves, saving data in text files, is a problem we’ve been solving since the very earliest days of computing.

If we cast our minds back a few decades to the earliest computers we’d recognise as computers we find primitive data formats like fixed-width text files are all the rage, e.g.:

Timestamp           Severity Message
2023-12-25T23:30:03 INFO     Carrot & Cookie placed on mantlepiece
2023-12-25T23:45:42 NOTICE   Vibration detected by roof sensor
2023-12-25T23:50:32 NOTICE   Motion detected by chimney sensor     

Fixed-width files are not so much a format as an approach. In effect, every fixed-width text file is its own format — the way it works is every line a text file is a collection of fields, and everyone agrees up-front what fields each line will contain, and exactly how many characters wide each field will be. If your data is shorter than the agreed field width you must pad your data with spaces.

For the above example, there would need to be documentation somewhere to record the key points of the format, i.e. that each line must contain the following (in order):

A timestamp in ISO8601 format (exactly 19 characters wide)
A space
A syslog severity in an 8-character-wide field
A space
A log message taking up the rest of the line

These kinds of formats are very human-readable — as you scroll everything lines up nicely — but, they’re also spectacularly brittle! One line of code somewhere in the code base that doesn’t validate the length of a piece of data and the entire record is corrupt! Not to mention how difficult it is to change your mind about the structure of the data once code has been written — extending or adding a field requires a careful review of every single line of code that processes the files to make sure all the offsets have been adjusted as needed.

In my entire professional life, I’ve encountered just one production system using fixed-width text files for data import, and it was just as much of a pain as you’d expect! We did need to alter the structure, and it was definitely not a smooth process. We had obscure bugs crawling out of the woodwork for years after 🙁

As well as being brittle these files are also limited — they can only represent tabular data containing strings and numbers — they can’t represent lists of dictionaries.

Because fixed-width fields are so difficult to work with they were soon followed by Tab-Separated Value (TSV) files, and shortly after that, Comma-Separated Value files (CSV) files.

A TSV file (also called a tab file) looks superficially similar to a fixed-width text file, but instead of counting characters, the fields are separated by tab characters:

Timestamp	Severity	Message
2023-12-25T23:30:03	INFO	Carrot & Cookie placed on mantlepiece
2023-12-25T23:45:42	NOTICE	Vibration detected by roof sensor
2023-12-25T23:50:32	NOTICE	Motion detected by chimney sensor

TSV files are less brittle because you don’t need to count character offsets anymore, but you now have two new problems:

Your data can’t contain tabs
Humans can’t see the difference between tabs and spaces easily, so debugging corrupted TSV files can be infuriating!

Comma-Separated Value (CSV) files improve on TSV files a little by making two changes:

Fields are separated with commas, not tabs — this makes it easier for humans to debug
The specification includes support for quoting field data so it can include commas and even escaped quotation marks

This gives us what remains to this very day, a widely used generic text format for sharing tabular data:

Timestamp,Severity,Message
2023-12-25T23:30:03,INFO,"Carrot & Cookie placed on mantlepiece"
2023-12-25T23:45:42,NOTICE,"Vibration detected by roof sensor"
2023-12-25T23:50:32,NOTICE,"Motion detected by chimney sensor"

However, note that both TSVs and CSVs have sacrificed readability for robustness — because the field boundaries no longer line up visually, so you need to count separators to know which column you’re reading. And, like fixed-width fields, TSVs and CSVs are limited to tabular data.

The next text-based data format to really take off was the Extensible Markup Language, or XML.

Because of its enthusiastic adoption by major enterprise vendors like Sun Microsystems (Java), Oracle & Microsoft (IIS), XML is still doing a lot of heavy lifting in large enterprises to this day. With my work hat on I still see XML used in lots of config files, for serialising and deserialising data to and from disk, and for moving data around with XML-based message passing and remote procedure calls (SOAP, XMLRPC, etc.).

XML is not built around the idea of lines with separated fields, instead, it uses nested tags. This means you can represent just about any data structure you like in XML. And, with some sane tag names, sensible indentation, and some syntax highlighting, it can be quite human-readable:

<message>
  <timestamp>2023-12-25T23:30:03</timestamp>
  <severity>INFO</severity>
  <message>Carrot &amp; Cookie placed on mantlepiece</message>
</message>
  <timestamp>2023-12-25T23:45:42</timestamp>
  <severity>NOTICE</severity>
  <message>Vibration detected by roof sensor</message>
<message>
  <timestamp>2023-12-25T23:50:32</timestamp>
  <severity>NOTICE</severity>
  <message>Motion detected by chimney sensor</message>
</message>

So, while XML is powerful and pretty human-readable, it’s also extremely verbose! The power and the human-readableness come at the price of wasted disk space and/or bandwidth.

The next contender to really make a splash was JSON, which we’ve already made heavy use of throughout this series, so you’re already intimately familiar with it.

Like XML, JSON can represent complex data types and is very human-readable. But it does so while being much less verbose, and hence, less inefficient:

[
  {
    "timestamp": "2023-12-25T23:30:03",
    "severity": "INFO",
    "message": "Carrot & Cookie placed on mantlepiece"
  },
  {
    "timestamp": "2023-12-25T23:45:42",
    "severity": "NOTICE",
    "message": "Vibration detected by roof sensor"
  },
  {
    "timestamp": "2023-12-25T23:50:32",
    "severity": "NOTICE",
    "message": "Motion detected by chimney sensor"
  }
]

Finally, we get to YAML!

YAML builds on JSON but takes things just that little bit further by being even more human-readable, and also more human-writable, all while also being even less verbose:

- timestamp: 2023-12-25T23:30:03
  severity: INFO
  message: Carrot & Cookie placed on mantlepiece
- timestamp: 2023-12-25T23:45:42
  severity: NOTICE
  message: Vibration detected by roof sensor
- timestamp: 2023-12-25T23:50:32
  severity: NOTICE
  message: Motion detected by chimney sensor

YAML’s Philosophy

OK, with all that groundwork laid, let’s start our exploration of YAML itself.

As I said in the introduction, YAML was designed to solve many problems, not just those of interest to us in this mini-series. Quoting from the specification (emphasis added):

YAML (a recursive acronym for “YAML Ain’t Markup Language”) is a data serialization language designed to be human-friendly and work well with modern programming languages for common everyday tasks.

…

There are hundreds of different languages for programming, but only a handful of languages for storing and transferring data. Even though its potential is virtually boundless, YAML was specifically created to work well for common use cases such as: configuration files, log files, interprocess messaging, cross-language data sharing, object persistence and debugging of complex data structures. When data is easy to view and understand, programming becomes a simpler task.

That last sentence really speaks to me — the YAML designers world view clearly aligns nicely with mine 🙂

(Note: in its very earliest days YAML was a humours acronym — ‘Yet Another Markup Language’)

Terminology

Firstly, to make YAML as flexible as possible, it was designed from the ground up to be embeddable within other file formats, and, to allow multiple distinct pieces of YAML data within single files and data streams. The YAML specification refers to each of these separate chunks of YAML data as Documents. So, to put it another way, YAML documents can be embedded within any file, and you can embed as many YAML documents as you like.

Secondly, because YAML was designed to be very generic, it provides single representations for data types that have many synonyms in other languages. This means that for each core concept, the YAML designers had to choose one official synonym for use in the YAML spec.

This is how the YAML spec represents data:

Documents — one or more pieces of data
- Scalars — single values like strings, numbers, booleans, etc.
- Collections — compound values that come in two flavours:
  - Sequences — lists of values, AKA arrays, vectors, linked lists, etc.
  - Mappings — key-value pairs, AKA dictionaries, hash tables, lookup tables, etc.

YAML Versions

The most recent major version of the YAML specification is YAML 1.2 which was released in 2009, replacing YAML 1.1 which dated back to 2005. The most recent revision is YAML 1.2.2 which was released in 2021, and that’s the spec I’ve used as my reference for these notes.

When writing new YAML documents, my advice is to always use the newest version of the spec, but it can sometimes be useful to know about aspects of the older spec when reading YAML snippets online. The most notable example is the changes to Booleans between the 1.* and 2.* specs which we’ll get to shortly.

Visualising our YAML Examples (with the help of `yq`)

Before we dive into how YAML works, we’re going to need a way of illustrating what data a piece of YAML code represents. What we need is a translation between the YAML snippets we don’t yet understand, and some data format we’re already intimately familiar with. Give our journey in this series, JSON seemed the most suitable option.

So, throughout our exploration of YAML I’ll be using YAML → JSON conversions to illustrate the various concepts. I performed these conversions using a very interesting new command line tool named yq which is heavily inspired by jq, but can read and output information from multiple data formats, including both JSON & YAML.

I’ve already performed all the needed conversions and included the resulting JSON directly into the show notes, so you don’t need to install yq, but if you’d like to, you can learn all about it on their GitHub page. Also, if you use a Mac, you can install it via Homebrew (like we did jq) with the command:

brew install yq

If you’re curious how I did the conversions without needing to save the YAML snippets into files first, I used the Mac terminal command pbpaste to send the contents of my clipboard into yq. So, my process was to simply copy the YAML snipped from the notes, then run either of the following terminal commands:

# for long output
pbpaste | yq -o=j

# for compact output
pbpaste | yq -o=j -I=0

Note: There are actually two tools with the name yq, the other one is a crude Python-based wrapper around jq that converts YAML to JSON using Python’s native libraries, and then passes that JSON to the jq command. The big draw-back here is that you need to install both Python & jq to use this tool. But, if you’re already using both Python & jq you may find it useful — you can get the details on their their GitHub page.

Basic YAML Syntax

In this first instalment we’re going to start with the basics, then we’ll add a few extra nuances in the next instalment.

Starting & Ending Documents

In YAML so-called structures are used to signal the start and/or end of a YAML document to a YAML interpreter.

There are just two structures, and both have to appear by themselves on a single line:

YAML Structure	Description
`---`	Start a new YAML document, if there is already one started, start a new one.
`...`	End a YAML document without starting a new one.

This means that, according to the spec, you should start your embedded YAML with ---, and end it with ....

As an interesting side-note — you might notice that the norm for adding front matter to Markdown files in many Markdown-based static site generators appears to violate the standard — the front matter has to be at the top of the document and it must be enclosed between --- structures. Both Jekyll and Obsidian use this approach. The starting triple-dashes are as expected, but should the ending not be triple-dots? Actually, no! As we’ll discover next time, a single string spread over multiple lines with no indentation is in fact a valid YAML document, so when a Markdown file with front matter is passed through a YAML parse what it finds is actually two YAML documents, a mapping with metadata, and a big long string!

YAML Comments

Unlike JSON, YAML does support comments, but unlike most proper programming languages, it doesn’t offer coders a choice of comment syntaxes. YAML supports one commenting syntax — Bash-style # comments:

---
# A YAML document containing just this comment
---
# Another YAML document containing a string
a string # a trailing comment after the string

Indentation is Not Optional!

YAML uses indentation to represent scope — that is to say, things indented to the same level are part of the same collection or multi-line value.

At first glance, you might be worried that this could cause all sorts of weird bugs when you copy and paste between multiple documents — what if one uses tabs and the other spaces, but they look visually the same? YAML has your back thanks to a very sensible design choice YAML forbids the use of tabs for indenting!

Simple Scalars in YAML

Let’s start with the simplest scalar value — the non-value value, AKA null. YAML supports three representations for the null value — the keywords null, Null & NULL, the character ~, or absolute emptiness, that is to say, no characters whatsoever.

The next simplest scalar is the boolean, which YAML supports in a few ways:

Boolean Value	YAML Syntax
True	The keywords `true`, `True` & `TRUE` (YAML 1.1 also allowed `Yes` & `On`)
False	The keywords `false`, `False` & `FALSE` (YAML 1.1 also allowed `No` & `Off`)

Finally, let’s look at how YAML represents numbers:

Number Type	YAML Syntax
Integers	`1234` or `-1234`
Octal numbers	Pre-fixed with `0o`, e.g. `0o42` is `34` (YAML 1.1 prefixed with just `0`, causing confusion where `042` and `42` are not the same thing at all!)
Hexadecimal Numbers	Prefixed with `0x`, e.g. `0x4F` is `79`
Decimal Numbers	`123.45`, `-123.45`, or even `12.3e+5` for scientific notation
Infinity	Either `.inf` for positive infinity, or `-.inf` for negative infinity
Not-a-Number	The numeric value to represent the fact that the desired value can’t be represented as a number is `.NAN`.

Basic Strings in YAML

The first thing to note is that all strings in YAML are Unicode, so you can use all the accented characters and emoji you like 🙂

Also note that that while we’re going to confine ourselves to strings defined on a single-line in this instalment, YAML has extensive and very nuanced support for multi-line strings, but those are for the next instalment.

String Quoting (`''` or `""`)

The third thing to note, is that unless there’s some kind of contextual ambiguity, string quoting is optional in YAML.

In situations where you do need to quote your strings to avoid ambiguity, you need to choose between single and double quotes, and your choice is not purely aesthetic, they have different meanings! In YAML, single quoted strings don’t support escape sequences, while double quoted strings require escape sequences.

In other words, the YAML string 'Hello\nWorld' literally represents the word Hello followed by a backslash symbol, followed by the letter n followed by the word World. On the other hand, the YAML string "Hello\nWorld" represents the word Hello followed by a new line character, followed by the word World.

Collections

Now that we can represent single values, let’s look at the two types of collections YAML provides.

Sequences (Arrays/Lists)

To define what we would think of as an array in JavaScript, JSON, or jq, the basic YAML syntax is to start each entry in the array with a dash followed by a space, so we can define a simple array of mixed values with:

---
- True
- 42
- -3.1415
- Hello World!

Running this through yq we get the following simple JSON array:

[true,42,-3.1415,"Hello World!"]

Mappings (Dictionaries/Objects/Hashtables)

To define what we have been referring to as dictionaries in this series, the syntax is simply keys-value pairs separated by colons :. For example, the following YAML document lists sales data by category:

---
Food: 365
Drinks: 432
Confectionary: 98

Running this through yq we get the following simple JSON dictionary:

{
  "Food": 365,
  "Drinks": 432,
  "Confectionary": 98
}

Nesting (Just Indent!)

Both sequences and mappings can contain sequences, mappings, and scalars, and you simply use indentation to define which values belong to which sequences or mappings.

As a first example, let’s look at the entire Jekyll configuration file for this website:

remote_theme: bartificer/bartificer-jekyll-theme
plugins:
  - jekyll-remote-theme
title: Programming by Stealth
email: podcasting@bartificer.net
description: A blog and podcast series by Bart Busschots & Allison Sheridan.
baseurl: "" # the subpath of your site, e.g. /blog
url: "" # the base hostname & protocol for your site, e.g. http://example.com
github_username:  bartificer

# theme-specific options
accent:
  color: "#00408d"
  # color_light: "#fbecd6"
  color_light: "#e6f1ff"
  font_family: '"Ubuntu Mono", mono'
nav_items:
  - url: https://bartb.ie/pbsindex
    icon: "fas fa-search"
    text: "PBS Index"
  - url: https://www.podfeet.com/blog/category/programming-by-stealth/
    icon: "fas fa-podcast"
    text: "Podcast Episodes"
community:
  url: https://podfeet.com/slack
  description: Find us in the PBS channel on the Podfeet Slack.
  icon: "fab fa-slack"
  labels:
    button: "Podfeet Slack"

The first thing to notice is that at the top level, this entire config is one big mapping, i.e. a dictionary. This is the normal structure for YAML config files.

Most of the keys in our mapping map directly to scalars, for example the remote_theme key maps to the string bartificer/bartificer-jekyll-theme.

We also have nested mappings, the deepest one being three deep:

community:
  url: https://podfeet.com/slack
  description: Find us in the PBS channel on the Podfeet Slack.
  icon: "fab fa-slack"
  labels:
    button: "Podfeet Slack"

The community key in the top-level mapping maps to another mapping with four keys — url, description, icon & labels. The community.labels key maps to yet another mapping, but this one has just one key, button.

This is what this section of the YAML file would look like as JSON:

{
  // …
  "community": {
    "url": "https://podfeet.com/slack",
    "description": "Find us in the PBS channel on the Podfeet Slack.",
    "icon": "fab fa-slack",
    "labels": {
      "button": "Podfeet Slack"
    }
  }
  // …
}

We also have an example of a mapping containing an array of mappings:

nav_items:
  - url: https://bartb.ie/pbsindex
    icon: "fas fa-search"
    text: "PBS Index"
  - url: https://www.podfeet.com/blog/category/programming-by-stealth/
    icon: "fas fa-podcast"
    text: "Podcast Episodes"

The top-level nav-items key maps to an array with two elements, each of which is a mapping with three keys — url, icon & text.

Again, this is what this section of the YAML file would look like as JSON:

{
  // …
  "nav_items": [
    {
      "url": "https://bartb.ie/pbsindex",
      "icon": "fas fa-search",
      "text": "PBS Index"
    },
    {
      "url": "https://www.podfeet.com/blog/category/programming-by-stealth/",
      "icon": "fas fa-podcast",
      "text": "Podcast Episodes"
    }
  ]
  // …
}

Since this is our first complete real-world example I’ll point out two more things:

The use of comments both to add information and to comment out an old value for the accent.color_light key
The needless over-use of quoted strings due to my naiveté when creating this config many years ago. I was writing YAML without even realising it was YAML. I was just copying from a sample, proving that YAML really is very user-friendly 🙂

Finally, I want to draw your attention to something so you can put a mental pin in it for the next instalment — the standard YAML notation for nested arrays of simple values is clunky!

This is the YAML to define a simple nested array where the top-level array contains one number and one array, and the second-level array contains two numbers:

---
- 1
- 
  - 2
  - 3

You could argue the above YAML is actually less clunky than the normally indented JSON:

[
  1,
  [
    2,
    3
  ]
]

However, we all know we’d never choose that JSON indentation style for something so simple, we’d use the compact notation:

[1,[2,3]]

Clearly, there’s room for some syntactic sugar, i.e. some kind of optional additional syntax for better representing simple nested data structure. That’s the perfect teaser to end this instalment on!

Final Thoughts

Hopefully, this has given you a taste for YAML, and a good grounding in the basics.

We’ll finish our short exploration of YAML in the next instalment when we look at a few advanced topics, and we’ll explore the basics of the jq-like yq command, and use it to process some YAML files.

No Previous (on First) YAML Miniseries Advanced Topics

167. Recursion, Syntactic Sugar, Some old Friends and a Few Honourable Mentions Programming by Stealth 169. Advanced Topics

Join the Community

Find us in the PBS channel on the Podfeet Slack.

Podfeet Slack