Kostas Kalafatis

Posted on Sep 9, 2023 • Edited on Sep 13, 2023

A Gentle Introduction to the YAML format

#beginners #tutorial #devops #github

What is YAML?

YAML is a data-serialization language that is human-friendly. YAML's main goals are simplicity and readability. You can think of YAML as JSON without the ugly parts. Because of its ease of use, YAML is being adopted for storing configuration files and other types of structured data that is meant to be edited by hand.

Basic Rules

First of all lets discuss some basic rules about working with YAML files:

Whitespace and indentation matters and plays an important part in structuring the data, so take special care to stay consistent.
Never use tabs outside of quoted strings, especially for indentation. The tab character is illegal within YAML files.
Although the number of spaces doesn't matter, as long as the child node indentation is greater that its parent, it is a good practice to keep the same number of spaces.
YAML is case sensitive.
YAML's comments start with a # and go until the end of the line.

With that out of the way, let's see the basic YAML syntax.

Indentation

YAML uses indentation to indicate the structure and hierarchy of the data, in a manner reminiscent of Python's indentation rules.

The recommended indentation for YAML files is two spaces per level, but YAML can also follow any indentation system that the individual file uses. However, you should be consistent and avoid mixing different indentation styles in the same file.

Building Microservices:
    author: Sam Newman
    language: English
    publication-year: 2021
    pages: 586

Comments

YAML only supports single-line comments. All you have to do is to add a # at the beginning of the line. You can also put it in the middle of the line. The text after # till the end of the line is disregarded.

# This is a single line comment
foo: bar # this is an inline comment

Scalars

Scalars are a pretty basic concept. They are the most basic of all data types and are simple string, integer, float and boolean values.

Integer scalars are numeric values that represent whole numbers, such as 1, 42, -7, etc. Integer scalars do not require quotes, and are typically treated as numeric types.

count: 3 # This is an integer scalar

Float scalars are numeric values that represent fractional or decimal numbers, such as 3.14, -0.5 etc. Float scalars do not require quotes, and are typically treated as numeric types.

# This is a float scalar
pi: 3.14 

# This is also a float scalar
area: 19.625

# And yet another scalar, in scientific notation
mass: 1.67e-27

Bool scalars are boolean values that represent true or false, such as true, false, yes, no, on and off. Bool scalars do not require quotes, and are typically treated as logical values.

# These are true/false boolean scalars
active: true
enabled: false

# These are yes/no boolean scalars
email-consent: yes
sms-consent: no

# These are on/off boolean scalars
switch: on
light: off

You should be aware that different versions of YAML may have different rules for interpreting bool scalars. For example, in YAML 1.1, yes and no are valid bool scalars, but YAML 1.2, they are not. For that reason, it is recommended to use true and false for bool scalars to avoid confusion and compatibility issues.

Strings can be written in different ways, depending on the syntax and style we are using. There are two main types of formats that YAML supports for strings.

Block String Scalars

Block scalars are a way of representing strings that can span multiple lines. They are defined by starting with a pipe character (|) or right angle bracket (>), followed by a space, and then the string content. The string content can be indented to any level, and the indentation will be preserved in the output. Block scalars are useful for representing strings that are long or complex. They can also be used to represent strings that contain newline characters.

Literal blocks are defined by a pipe character (|), followed by a space, followed by the string content. The string content is not folded so any newline characters will be preserved in the input.

key1: |
    This is a 
    literal string
    with line breaks
    and spaces.

If we convert this to JSON we will get the following:

{
    key1: This is a\nliteral string\nwith line breaks\nand spaces.\n
}

Folded blocks are defined by a right angle bracket (>), followed by a space, followed by the string content. The string content is folded, so any newline characters are replaced with spaces.

key1: >
  This is a folded block.
  It can span multiple lines, but newlines
  will be replaced with spaces.

If we convert this to JSON we will get the following:

{
    key1: This is a folded block. It can span multiple lines, but newlines will be replaced with spaces.\n
}

Flow String Scalars

Flow scalars are a way of representing strings that can span multiple lines. They use quotes and escape sequences to indicate the structure and boundaries of a string. They are more compact and readable than block string scalars, but they are more limited in escaping support. Flow scalars are defined by starting with a colon (:), followed by the string content. The string content can be indented to any level, but the indentation will be ignored in the output.

Plain flow scalars are the simplest type. They do not use any quoting characters so any characters in the string content are interpreted literally.

key: This is a plain flow scalar.

Single-quoted flow scalars use single quotes (') to quote the string content. This prevents any special characters in the string from being interpreted literally.

key: 'This is a single-quoted string with ''single quotes''.'

Double-quoted flow scalars use double quotes (") to quote the string content. This allows for the use of special characters, such as newlines and backslashes.

key: This is a 
flow string scalar
that becomes a 
single line.

Sequences

Sequences are collections of items that are ordered and indexed. The items in a sequence can be of any type, including strings, numbers, mappings or other sequences. There are two main ways of writing sequences supported by YAML.

Block Sequences

Block sequence format uses dashes (-) to indicate each item in the sequence. The items must be indented under the parent node, and they can span multiple lines. Block sequences are more readable and flexible than flow sequences, but they take up more space.

# A block sequence of strings
fruits:
  - apple
  - banana
  - cherry

# A block sequence of dictionaries
users:
  - name: Alice
    age: 35
    hobbies:
      - reading
      - writing
      - horseback riding
  - name: Bob
    age: 33
    hobbies:
      - coding
      - gaming
      - miniature painting

Flow Sequences

Flow sequences use square brackets([]) to enclose the items in the sequence. The items are separated by commas (,) and they can be on the same line as the parent node. Flow sequences are more compact and concise than block sequences, but they have less escaping support.

# A flow sequence of strings
fruits: [apple, banana, cherry]

# A flow sequence of dictionaries
users: [{name: Alice, age: 35, hobbies: [reading, writing, horseback riding]}, {name: Bob, age: 33, hobbies: [coding, gaming, miniature painting]}]

Mappings

A YAML mapping is a collection of key-value pairs where each key is associated with a value. Mappings are similar to dictionaries, objects or associative arrays. In a mapping, each key must be unique but each key can have multiple values. The order of the keys is important.

map:
  - key1: value1
  - key2: value2
  - key3: value3

Block Mappings

Block mappings use indentation and colons (:) to represent key-value pairs, and are typically used without any explicit delimiters such as curly brackets. Block sequences are more human readable, but also more verbose.

Block mappings can contain nested block mappings by increasing the level of indentation for nested key-value pairs.

user:
  username: cinnamon
  name: John
  surname: Doe
  email: cinnamonroll@example.com
  billing-address:
    street: Some Street
    number: 32
    zip-code: 17288

Flow Mappings

Flow mappings use curly brackets ({}) to enclose the data. The key and value are separated by a colon (:) and each key-value pair within a mapping is separated by a comma (,). Flow mappings are often used when brevity is important.

user:{username: cinnamon, name: John, surname: Doe, email: cinnamonroll@example.com, billing-address: {street: Some Street, number: 32, zip-code: 17288}}

Below is an example YAML file with all the elements we discussed this far.

# SCALAR TYPES

# Our root object will be a map
key: value
another-key: another string value that goes on and on
a-number-value: 100
a-number-in-scientific-notation: 1e+12
a-hex-value: 0x123 # this will evaluate to 291
an-octal-value: 0123 # this will evaluate to 83

# And now some boolean values
booleanTrue: true
booleanFalse: false

yesValue: yes # this evaluates to true
noValue: no # this will evaluate to false

# Strings don't need to be quoted but they can be
a-simple-string: Does not require quotes
single-quotes: 'have ''one'' escape pattern'
double-quotes: "have many \", \0, \t, \u223A, \r escape patterns"

# UTF charactes need to be encoded
utf-superscript: \u00B2

# Special characters must be enclosed in quotes
special-characters: "[ John ] & { Jane } - <Doe>"

# Multiple-line strings can be written as a literal block
literal_block: |
  This entire block of text will have
  its value preserved
  with line breaks being preserved.

  The literal continues until de-dented, and the leading indentation is
  stripped.

# Or in a folded block
folded_style: >
  This entire block of text will have its values preserved with line
  breaks being converted into spaces.

  Blank lines, like above, are converted to a newline character.

      'More-indented' lines keep their newlines, too -
      this text will appear over two lines.

# COLLECTION TYPES

# Nesting uses indentation. Better use 2 space indent
a_nested_map:
  key: value
  another_key: Another Value
  another_nested_map:
    hello: world

# Maps don't require string keys by the way
0.25: a float key

# Sequences look like this
a_sequence:
  - Item 1
  - Item 2
  - 0.5
  - Item 4
  - key: value
    another_key: another_value
  - - This is a sequence
    - inside another sequence
  - - - Sequence-ception

# Since YAML is a superset of JSON we can use JSON-style maps and sequences
# also quotes are optional
json-map: {key: value}
json-sequence: [1, 2, 3, 4, 5]

Documents

Up to this point we worked only with a single YAML document. A single YAML file can have more than more than one document. Each document can be interpreted as a separate YAML file which means multiple documents can contain the same/duplicate keys.

A YAML file with multiple documents would look like below, where each new document is indicated by ---.

---
# document 1
name: John Doe
age: 30

---
# document 2
pets:
  - name: Spot
    breed: Labrador Retriever
  - name: Whiskers
    breed: Siamese

---
# document 3
address:
    street: 123 Main Street
    city: Anytown
    state: CA
    zipcode: 12345

The three documents are separated by three dashes (---). This tells the parser that each document is a separate unit of data.

Schemas and Tags

A schema in YAML is a definition of the structure of a YAML document. It specifies the allowed keys, types, and values for each key. A schema can be used to validate a YAML document, to ensure that it conforms to the expected structure.

There are three default schemas

FailSafe Schema: It only understand maps, sequences and strings and it is guaranteed to work for any YAML file.
JSON Schema: It understands all types supported within JSON including boolean, null, int and float, as well as the ones in the FailSafe schema.
Core Schema: It is an extension of the JSON schema, making it more human-readable supporting the same types but in multiple forms. So, null, Null and NULL will be resolved to the same type null.

It is also possible to create your own custom schemas based on the above default schema. For example, the following YAML schema defines a person:

type: object
properties:
  name:
    type: string
  age:
    type: integer

This schema specifies that a person is an object with two properties: a name and an age. The name property must be a string, and the age property must be an integer.

This leads us to the next question. What if I wanted to explicitly parse a value in a specific way?

This is where tags come into the picture. A YAML tag is a way to specify the type of data that is being represented. Tags are prefixed with two exclamation marks (!!), followed by a URI. Even though we didn't explicitly mention the tags/types of any of the YAML snippets we've seen so far, they are inferred automatically by the YAML parser. For instance, mappings have the tag/type tag:yaml.org,2002:map. Bellow is a snippet that works perfectly even when we specify tags.

name: John Doe
age: !!int 30
pets:
  - name: Spot
    breed: Labrador Retriever
    species: !!tag:yaml.org,2002:animal dog
  - name: Whiskers
    breed: Siamese
    species: !!tag:yaml.org,2002:animal cat

This file uses two tags:

!!int: this tag specifies that the value of the age key is an integer.
!!tag:yaml.org,2002:animal: this tag specifies that the value of the species key is a YAML animal type. This is a custom tag that is defined by the YAML organization.

Anchors and Aliases

In YAML files, anchors (&) and aliases (*) are used to avoid duplication. When writing large configurations in YAML, it is common for specific configurations to be repeated. In the following example, the service configuration is repeated for all three services.

service-configuration:
  service1:
    tasks:
    - name: Install Apache
      apt:
        name: apache2
        state: present

    - name: Start Apache
      service:
        name: apache2
        state: started
        enabled: true

  service2:
    tasks:
    - name: Install Apache
      apt:
        name: apache2
        state: present

    - name: Start Apache
      service:
        name: apache2
        state: started
        enabled: true

  service3:
    tasks:
    - name: Install Apache
      apt:
        name: apache2
        state: present

    - name: Start Apache
      service:
        name: apache2
        state: started
        enabled: true

As we add more and more settings for large configuration files, this will quickly become tedious. Also if you want to make a change in the configuration, you will have to find every entry in the config and change it.

Anchors and aliases allow us to rewrite the same snippet without repeating any configuration. Anchors (&) are used to define a chunk of configuration, and aliases are used to refer to that chunk at a different part of the configuration

service-configuration:
  service1:
    tasks: &task-configuration
    - name: Install Apache
      apt:
        name: apache2
        state: present

    - name: Start Apache
      service:
        name: apache2
        state: started
        enabled: true

  service2:
    tasks: *task-configuration

  service3:
    tasks: *task-configuration

Congratulations on finishing the article! You are now well on your road to being a YAML master. YAML is a popular markup language that is used practically anywhere when configuration must be written by hand. YAML's popularity may be seen in Kubernetes, Ansible, docker-compose, and GitHub Actions.

Hope you enjoyed reading!

Top comments (6)

Teidra Scollard • Sep 10 '23

Great article! I am a newbie and this has given me a clear understanding of YAML and it concepts. 👍🏾

Kostas Kalafatis • Sep 10 '23

Thanks for your comment! I'm glad you liked the post. I'm planning to make a complete series on GitHub actions, and this is the first post. If you're interested, stick around!