Recently I have began to use Jupyter notebooks with Python but have struggled with the constant need to download dependencies or have something not download correctly. Seeing this as a continuing trend, and wanting the portability between computers for the development environment, I turned to learning how Docker works.
Working on a 2.8 GHz Intel Core i7 processor, I began researching different methods of setting up a Docker environment on this computer along with any other I wanted to switch to at a later date. I found two methods to set up Intel Python in Docker using Jupter Notebooks. When setting up the Intel Distribution of Python, I used Jupyter Notebooks as the front end for code, equations, and visualizations. This is what I am currently using for classes and find that it works great when needing to share code between team members.
To set this up, like mentioned, I wanted to use Docker, which allows for containerization of the notebooks in order to package and run applications. By using Docker, this allows for an easily transferable environment to code in. When using Docker to set up Jupyter notebooks for the Python distribution, it is possible to use the already prepared image or to use an image as a base when customizing your own. Below I look at both ways to set up a Docker image for Intel Python on Jupyter notebooks.
Docker Image
The Intel distribution has both Python 2 and Python 3 images in Docker with core or full configurations. The core configurations contain NumPy/SciPy with dependencies while full contains everything that Intel distributes. For my purposes I used the full version of Intel Python 2.
To get started using a Docker image with Jupyter notebooks, I downloaded the image I wanted from Docker Hub and set up a volume to use with the image. The volume is an optional addition when using a Docker container but it allows for persistent data. I used a volume in this instance because it was the place I stored all the notebooks I wanted to run. When the container is no longer running, data doesn't persist and having data only available in the container can make it difficult to get out when another process requires it. Therefore, I created a volume to use on the host machine for later use with the container. To set up this Docker container, I followed the steps below:
- Download the Docker image from Docker Hub.
- Set up a folder to act as a volume for Docker,
~/Documents/notebooks
was set up on the computer and attached to/home/notebooks
in the Jupyter notebooks container. This allows for files to be easily accessible and version controlled after closing down the notebook. - Open a terminal and run the notebook.
# Pull image
docker pull intelpython/intelpython2_full
# Set up folder
mkdir ~/Documents/notebooks/
# Run the notebook
docker run -v ~/Documents/notebooks:/home/notebooks -p 8888:8888 intelpython/intelpython2_full jupyter notebook --ip='*' --port=8888 --allow-root --no-browser
This may work for many applications but this is where I ran into a problem. When working on the code I was running in Jupyter notebook there was a call to seaborn which is used in Python for visualizations based on matplotlib. This library is used to create more attractive statistical graphics in Python. Using the full image of Intel Python from Docker Hub doesn't provide the needed libraries. With this, I worked to customize the Docker image using a Dockerfile to add in seaborn.
Dockerfile for Customization
To create a customized Docker image based on Intel Python that can be run in Jupyter notebooks I set up a Dockerfile with based on the Docker Hub Dockerfile's from Intel Python. With this, continuumio/miniconda is used as the base image to work from. This is because Anaconda is a platform powered by Python that contains the most popular data scinece packages for Python and R. These packages can then be installed with the conda dependency and environment manager. By using this image, all needed packages not included in Intel Python can be then installed with conda when creating the customized image.
# Set the base image using miniconda
FROM continuumio/miniconda3:4.3.27
# Add metadata
LABEL version="1.0" \
description="Intel Python 2 using Jupyter Notebooks" \
date_created="01march2018" \
date_modified="28march2018"
With this, the environmental vairable ACCEPT_INTEL_PYTHON_EULA
is set to 'yes' with the command ENV. This is the acceptance of the End_User License Agreement (EULA) for Intel Python which needs to be accepted everytime a new environment is created. After setting this variable the RUN command can be used to execute shell commands in a new layer. Each time this command is executed a new layer is created. Using this command, conda can be used to install Intel Python, seaborn, and any other data science libraries you may need or want. Then apt-get is used to update and then install g++. After configuring a custom image, it can now be built and run for use.
# Set environmental variable(s)
ENV ACCEPT_INTEL_PYTHON_EULA=yes
# Installs, clean, and update
RUN conda config --add channels intel\
&& conda install -y -q intelpython2_full=2018.0.1 python=2 \
&& conda install seaborn \
&& apt-get clean \
&& apt-get update -qqq \
&& apt-get install -y -q g++
Build an Image
After completing the Dockerfile, check that you are in the correct location on command line before running commands. I have often found myself in the wrong directory when I go to look at something else first, before coming back to build an image.
$ ls
Dockerfile
Then, to build the image, run the build command with a tag, -t, for the image. This tag gives in an easy to use name to the image, I called mine test_intel to be able to pick it out of a list quick. This may take a few minutes to build the image.
docker build -t test_intel .
Run an Image
After the image is built, you can check Dockers image registry on your local machine to see the image in the list. When running this command, a list will appear to show you the repository name, tag, image ID, time created, and size of the image like the example shown below. This is a good check to make sure the image built before moving forward.
docker image ls
REPOSITORY TAG IMAGE ID SIZE
test_intel latest ce5d8aa2966d 6.52GB
Once complete, it is time to run the image. Running the image works similar to the first example of setting up the core or full Docker image without customizations. To run this command, replace the image name with the new image you have just created in previous steps, test_intel.
docker run -v ~/Documents/notebooks:/home/notebooks -p 8888:8888 test_intel jupyter notebook --ip='*' --port=8888 --allow-root --no-browser
After running this command in the terminal, a URL should appear for you to copy and paste into the browser to connect to Jupyter notebook with the Intel Python distribution now installed and ready to go. Once connected, you can begin using your customized environment. To shut down the server and all kernels, use Control-C
in terminal.
References
Intel Optimized Packages for the Intel Distribution for Python
Docker
seaborn
miniconda
Cover image sourced from Docker Wallpapers
Top comments (15)
Great tutorial, good work!
My suggestion is to learn docker compose next to avoid having to type incredibly lengthy
docker run
commands.It also helps you keep everything under version control so you can easily share your creations with other people with minimal guesswork on their part.
Thanks, I'll have to check that out!
Great post! I'm a huge fan of using Docker for Data Science.
I gave a talk a few months ago on how to incorporate Docker into various Data Science Workflows. Hope you find it useful!
Thanks! I will look into it. Just started using it and loving it already.
Awesome post @rosejcday . Mapbox actually just launched a library for location data visualizations with Jupyter Notebooks, check it out and lmk what you think! github.com/mapbox/mapboxgl-jupyter
This post is a great intro to setting up Jupyter using Docker! 👍🏼
I fought with a similar setup myself, after deciding to stop misusing my MacBook for data science experiments, and deploy a Docker container on the Google Cloud instead. Besides the things you have listed in your post, I had to tackle bundling a dedicated SHA-hashed Jupyter password, because my instance is publicly accessible over the Internet. Another issue I had to deal with was bundling the image with a private key for accessing the git repository where I keep my experiments. Not all without issues, but I managed. Maybe, I should sit down and write a post about this. Perhaps, it will be helpful to you and others.
The Docker Hub link is incorrect in
"Download the Docker image from Docker Hub."
Weird! It was the right link when I checked, try again and see if it works now for you.
It works now. Thanks for the article : )
I love docker and also how easy it is to use for Data Science. Thank you.
Why you don't use virtualenv?
What would the benefit of using virtualenv be inside a docker container? I have researched it but everyone seems to have mixed views on using it or not using it.
I mean, use virtualenv instead of docker
I like the portability of Jupiter too. I use it to 1) write stories out of data for management and to 2) write tutorials for Python. What are your use cases?
At the moment I use it mainly for school. It has been great for school projects that need to be shared between a team.