A problem I've noticed a lot of aspiring data engineers running into recently is trying to run Airflow on Windows. This is harder than it sounds.
For many (most?) Python codebases, running on Windows is reasonable enough. For data, Anaconda even makes it easy - create an environment, install your library and go. Unfortunately, Airbnb handed us a pathologically non-portable codebase. I was flabbergasted to find that casually trying to run Airflow on Windows resulted in a bad shim script, a really chintzy pathing bug, a symlinking issue* and an attempt to use the Unix-only passwords database.
So running Airflow in Windows natively is dead in the water, unless you want to spend a bunch of months rewriting a bunch of the logic and arguing with the maintainers**. Luckily, there are two fairly sensible alternate approaches to consider which will let you run Airflow on a Windows machine: WSL and Docker.
WSL
WSL stands for the "Windows Subsystem for Linux", and it's actually really cool. Basically, steps look something like this:
- Install the WSL by running some cryptic PowerShell commands
- Install Ubuntu from the Microsoft Store
- Type "Ubuntu" into the search bar, mash enter, and be dumped into a containerized Linux environment
I have WSL 2 installed, which is faster and better in many ways aside but which (until recently? unclear) needs an insider build of Windows.
Given that this is a fully operational Ubuntu environment, any tutorial that you follow for Ubuntu should also work in this environment.
Docker
The alternative, and the one I'm going to demo in this post, is to use Docker.
Docker is a tool for managing Linux containers, which are a little like virtual machines without the virtualization, making them act like self-contained machines but much more lightweight than a full VM. Surprisingly it works on Windows - casually, even.
Brief sidebar: Docker isn't a silver bullet, and honestly it's kind of a pain in the butt. I personally find it tough to debug and its aggressive caching makes both cache busting and resource clearing difficult. Even so, the alternatives - such as Vagrant - are generally worse. Docker is also a pseudo-standard and Kubernetes - the heinously confusing thing your DevOps team makes you deploy to - works with Docker images, so it's overall a useful tool to reach for especially for problems like this one.
Setting up Docker Compose
Docker containers can be ran in two ways: either in a bespoke capacity via the command line, or using a tool called Docker Compose that takes a yaml file which specifies which containers to run and how, and then does what's needed. For a single container the command line is often the thing you want - and we use it later on - but for a collection of services that need to talk to each other, Docker Compose is what we need.
So to get started, create a directory somewhere - mine's in ~\software\jfhbrook\airflow-docker-windows
but yours can be anywhere - and create a docker-compose.yml
file that looks like this:
version: '3.8'
services:
metadb:
image: postgres
environment:
POSTGRES_USER: airflow
POSTGRES_PASSWORD: airflow
POSTGRES_DB: airflow
networks:
- airflow
restart: unless-stopped
volumes:
- ./data:/var/lib/postgresql/data
scheduler:
image: apache/airflow
command: scheduler
depends_on:
- metadb
networks:
- airflow
restart: unless-stopped
volumes:
- ./airflow:/opt/airflow
webserver:
image: apache/airflow
command: webserver
depends_on:
- metadb
networks:
- airflow
ports:
- 8080:8080
restart: unless-stopped
volumes:
- ./airflow:/opt/airflow
networks:
airflow:
There's a lot going on here. I'll try to go over the highlights, but I recommend referring to the file format reference docs.
First of all, we create three services: a metadb, a scheduler and a webserver. Architecturally, Airflow stores its state in a database (the metadb), the scheduler process connects to that database to figure out what to run when, and the webserver process puts a web UI in front of the whole thing. Individual jobs can connect to other databases, such as RedShift, to do actual ETL.
Docker containers are created based on Docker images, which hold the starting state for a container. We use two images here: apache/airflow, the official Airflow image, and postgres, the official PostgreSQL image.
Airflow also reads configuration, DAG files and so on, out of a directory specified by an environment variable called AIRFLOW_HOME
. The default if installed on your MacBook is ~/airflow
, but in the Docker image it's set to /opt/airflow
.
We use Docker's volumes functionality to mount the directory ./airflow
under /opt/airflow
. We'll revisit the contents of this directory before trying to start the cluster.
The metadb implementation is pluggable and supports most SQL databases via SQLAlchemy. Airflow uses SQLite by default, but in practice most people either use MySQL or PostgreSQL. I'm partial to the latter, so I chose to set it up here.
On the PostgreSQL side: you need to configure it to have a user and database that Airflow can connect to. The Docker image supports this via environment variables. There are many variables that are supported, but the ones I used are POSTGRES_USER
, POSTGRES_PASSWORD
and POSTGRES_DB
. By setting all of these to airflow
, I ensured that there was a superuser named airflow
, with a password of airflow
and a default database of airflow
.
Note that you'll definitely want to think about this harder before you go to production. Database security is out of scope of this post, but you'll probably want to create a regular user for Airflow, set up secrets management with your deploy system, and possibly change the authentication backend. Your DevOps team, if you have one, can probably help you here.
PostgreSQL stores all of its data in a volume as well. The location in the container is at /var/lib/postgresql/data
, and I put it in ./data
on my machine.
Docker has containers connect over virtual networks. Practically speaking, this means that you have to make sure that any containers that need to talk to each other are all connected to the same network (named "airflow" in this example), and that any containers that you need to talk to from outside have their ports explicitly exposed. You'll definitely want to expose port 8080 of the webserver to your host so that you can visit the UI in your browser. You may want to expose PostgreSQL as well, though I haven't done that here.
Finally, by default Docker Compose won't bother to restart a container if it crashes. This may be desired behavior, but in my case I wanted them to restart unless I told them to stop, and so set it to unless-stopped
.
Setting Up Your Filesystem
As mentioned, a number of directories need to exist and be populated in order for Airflow to do something useful.
First, let's create the data
directory, so that PostgreSQL has somewhere to put its data:
mkdir ./data
Next, let's create the airflow
directory, which will contain the files inside Airflow's AIRFLOW_HOME
:
mkdir ./airflow
When Airflow starts it looks for a file called airflow.cfg
inside of the AIRFLOW_HOME
directory, which is ini-formatted and which is used to configure Airflow. This file supports a number of options, but the only one we need for now is core.sql_alchemy_conn
. This field contains a SQLAlchemy connection string for connecting to PostgreSQL.
Crack open ./airflow/airflow.cfg
in your favorite text editor and make it look like this:
[core]
sql_alchemy_conn = postgresql+psycopg2://airflow:airflow@metadb:5432/airflow
Some highlights:
- The protocol is "postgresql+psycopg2", which tells SQLAlchemy to use the psycopg2 library when making the connection
- The username is airflow, the password is airflow, the port is 5432 and the database is airflow.
- The hostname is
metadb
. This is unintuitive and tripped me up - what's important here is that when Docker Compose sets up all of the networking stuff, it sets the hostnames for the containers to be the same as the name of the container as typed into thedocker-compose.yml
file. This service was called "metadb", so the hostname is likewise "metadb".
Initializing the Database
Once you have those pieces together, you can let 'er rip:
docker-compose up
However, you'll notice that the Airflow services start crash-looping immediately, complaining that various tables don't exist. (If it complains that the db isn't up, shrug, ctrl-c and try again. Computers amirite?)
This is because we need to initialize the metadb to have all of the tables that Airflow expects. Airflow ships with a CLI command that will do this - unfortunately, our compose file doesn't handle it.
Keep the Airflow containers crash-looping in the background; we can use the Docker CLI to connect to the PostgreSQL instance running in our compose setup and ninja in a fix.
Create a file called ./Invoke-Airflow.ps1
with the following contents:
$Network = "{0}_airflow" -f @(Split-Path $PSScriptRoot -Leaf)
docker run --rm --network $Network --volume "${PSScriptRoot}\airflow:/opt/airflow" apache/airflow @Args
The --rm
flag removes the container after it's done running so it doesn't cutter things up. The --network
flag tells docker to connect to the virtual network you created in your docker-compose.yml
file, and the --volume
flag tells Docker how to mount your AIRFLOW_HOME
. Finally, @Args
uses a feature of PowerShell called splatting to pass arguments to your script through to Airflow.
Once that's saved, we can run initdb
against our Airflow install:
.\Invoke-Airflow.ps1 initdb
You should notice that Airflow is suddenly a lot happier. You should also be able to connect to Airflow by visiting localhost:8080 in your browser:
For bonus points, we can use the postgres container to connect to the database using the psql
CLI using a very similar trick. Put this in Invoke-Psql.ps1
:
$Network = "{0}_airflow" -f @(Split-Path $PSScriptRoot -Leaf)
docker run -it --rm --network $Network postgres psql -h metadb -U airflow --db airflow @Args
and then run .\Invoke-Psql
in the terminal.
Now you should be able to run \dt
at the psql prompt and see all of the tables that airflow initdb
created:
psql (12.3 (Debian 12.3-1.pgdg100+1))
Type "help" for help.
airflow=# \dt
List of relations
Schema | Name | Type | Owner
--------+-------------------------------+-------+---------
public | alembic_version | table | airflow
public | chart | table | airflow
public | connection | table | airflow
public | dag | table | airflow
public | dag_code | table | airflow
public | dag_pickle | table | airflow
public | dag_run | table | airflow
public | dag_tag | table | airflow
public | import_error | table | airflow
public | job | table | airflow
public | known_event | table | airflow
public | known_event_type | table | airflow
public | kube_resource_version | table | airflow
public | kube_worker_uuid | table | airflow
public | log | table | airflow
public | rendered_task_instance_fields | table | airflow
public | serialized_dag | table | airflow
public | sla_miss | table | airflow
public | slot_pool | table | airflow
public | task_fail | table | airflow
public | task_instance | table | airflow
public | task_reschedule | table | airflow
public | users | table | airflow
public | variable | table | airflow
public | xcom | table | airflow
(25 rows)
Conclusions
Now we have a working Airflow install that we can mess with. You'll notice that I didn't really go into how to write a DAG - there are other tutorials for that which should now be follow-able - whenever they say to run the airflow
CLI tool, run Invoke-Airflow.ps1
instead.
Using Docker, Docker Compose and a few wrapper PowerShell scripts, we were able to get Airflow running on Windows, a platform that's otherwise unsupported. In addition, we were able to build tooling to run multiple services in a nice, self-contained way, including a PostgreSQL database. Finally, by using a little PowerShell, we were able to make using these tools easy.
Cheers!
* Symbolic links in Windows are a very long story. Windows traditionally has had no support for them at all - however, recent versions of NTFS technically allow symlinks but require Administrator privileges to create them, and none of the tooling works with them.
** I'm not saying that the Airflow maintainers would be hostile towards Windows support - I don't know them for one, but also I have to assume they would be stoked. However, I also have to assume that they would have opinions. Big changes require a lot of discussion.
Top comments (3)
Addendum: Running in Production
I had someone ask me today about using this process to run Airflow in production. It should be noted that Docker doesn't work on all Windows installs. In particular, this reportedly won't work with server instances on Azure.
That said, if you're trying to run Airflow in production, you should probably deploy to Linux - or, if using Docker, to a managed Kubernetes product such as AKS on Azure or GKE on Google Cloud. Luckily, the only Windows-specific aspects of the procedure laid out here are the PowerShell snippets, and even PowerShell can run on Linux/MacOS if you install it.
I think Airflow now comes with an authentication requirement too...
I don't have time to run through this tutorial to update the directions, but if someone tells me what changed and what they did I'm happy to post an update (with a /ht!)