Say what?
Introduction of YAML
YAML stands for "YAML Ain’t Markup Language" - this is known as a recursive acronym. YAML is often used for writing configuration files. It’s human readable, easy to understand and can be used with other programming languages. Although YAML is commonly used in many disciplines, it has received criticism on the amoutn of whitespace .yml files have, difficulty in editing, and complexity of the standard. Despite the criticism, properly using YAML ensures that you can reproduce the results of a project and makes sure that the virtual environment packages play nicely with system packages. (If you're looking for another way to share environments there are other alternatives to YAML which include StrictYAML (a type-safe YAML parser) and NestedText)
One of the first steps in entering into an existing data science project is setting up your virtual environment. This makes sure that dependencies and packages used for this project do not interfere with each other or write over those previously used. There will often be a file with the .yml extension in the project files so you can quickly get working on the existing project. Below, I’ll quickly run through the steps I take to create a virtual environment on my M2 MacBook with Anaconda already installed.
Steps
So where is that YAML file on GitHub and what do I do with it?:
First, projects typically will have one .yml file but sometimes you’ll see special instructions in the project’s read me:
Here’s what the actual YAML file will look like (they're usually on the root level of the directory, but can sometimes be further down in the directory:
This is what the file will look like on GitHub:
To save the .yml file, simply click the Raw button here:
Then, in the newly opened tab, right click and save as:
(Make sure to save this somewhere you can easily find this as you’ll need to navigate to it!)
Open up a new terminal session and navigate to the directory where you saved the .yml file.
To create this new environment, I’ll enter:
conda env create -f geoenvironment.yml
After the virtual environment is done installing, you’ll need to activate it to use it. To do so, you’ll need to know it’s name which should be displayed with the command to activate it. If not, check a list of your environments by entering:
conda info --envs
Then activate the new environment by entering: (replace 'project-env' with the name of your virtual environment)
conda activate project-env
Now you’re ready to start chugging on that existing project.
When you’re done working in that virtual environment, don’t forget to deactivate and switch to the next environment you want!
conda deactivate
If you want to start a project from scratch, I prefer to start with a very basic virtual environment and add the packages I need as I go along. My basic framework usually consists of:
Python
NumPy
Pandas
MatplotLib
& sometimes Seaborn
Finally, once you've created your environment and you're ready to unleash it on the world you can run a simple command to export the .yml file. Once you have your file you can upload it or share it with whomever you need. Here is the command to export (feel free to replace "environment" with the desired name of your new environment):
conda env export > environment.yml
The whole process of creating and activating a new virtual environment is pretty simple when it works… However, if you run into errors such as not being able to find the right packages, it can get a little hairy. Luckily, there are great resources out that are just a quick google away. The most useful resources I found for these errors were on Stack Overflow and Apple Developer.
If you want to create a virtual environment from a .yml file, here’s a link to one of my projects (Tanzanian Water Wells: Predicting the Functionality of Water Wells in Tanzania) where you can try it out!
In Summary:
YAML isn’t scary (also ain’t markup language)
The .yml is an important feature in any Data Science workflow.
The .yml is used to ensure that packages and versions are the same
The .yml helps with reproducibility.
Including a .yml on a project allows for collaboration
YAML is altogether pretty simple.
Resources
The official YAML Web Site
If you’re looking for further resources on running TensorFlow and Keras on a newer MacBook, I recommend checking out this YouTube video: How to Install Keras GPU for Mac M1/M2 with Conda
If you’re looking for a resource on how to install Anaconda, I highly recommend that you go straight to the source, anaconda.org.
The M2 MacBook gave me some challenges when trying to work with TensorFlow and Keras due to some fancy chip architecture which you can read about here: TensorFlow with GPU support on Apple Silicon Mac with Homebrew and without Conda / Miniforge
If you want to take a dive into the YAML world, here's an in-depth tutorial: YAML: Everything You Need to Get Started in Minutes
Further, Very Serious Notes
Why did the YAML cross the road?
To get away from the package that broke the YAML's backend.
Top comments (0)