description: SQL and databases are among the most needed data science skills, it is #3 right afrer Python and R according to this empirical study. However, the need for a database isn’t obvious for the beginner programmer at first.At some point the aspiring data scientist will grow out the world of csvs and plain text files. Using databases becomes handy, when someone starts building Rest APIs or one has to connect to a remote SQL server full of gigabytes of valuable data. Here are our tips to get started with SQL and how to use it the Pythonic way.
SQL and databases are among the most needed data science skills. According to a recent study, SQL is the third most demanded skill right afrer Python and R. Suprisingly a beginner programmer can happily live without a database for a long time. However, at some point the aspiring data scientist will grow out the world of csvs and plain text files. Using databases becomes handy, when someone starts building Rest APIs or one has to connect to a remote SQL server full of gigabytes of valuable data. Here are our tips to get started with SQL and how to use it in the Pythonic way.
Which SQL implementation should I use?
SQL is a standard, its latest release came out in 2016. There are many closed and open source vendors who built their own implementation of the standard. Each likes to extend it with its own flavor, but the differences are minor (at least for a beginner). We encourage you to use MariaDB, unless you have good reason to ignore it (e.g. at work, your company is using MySQL, or at school you are learning about databases using Postgres, etc.)
Should I install it?
Absolutely no, you shouldn’t install it on your computer! Use the official Docker image of your prefered SQL implementation. If you don’t use Docker, invest some time into learning its basics.This tutorial helps you learn how to install Docker and start a container on your machine (the first twelve lessons till “Docker – Containers and Shells” is enough at first). Don’t simply start your Docker image, attach a volume to it, since this is the way to preserve (i.e. save) your databases. your effort turns to be a bonus, as knowing some Docker is a very valuable data science skill!
We strongly recommend you to start the phpMyAdmin, the free administration tool for SQL, Docker image along with your SQL implementation. phpMyAdmin provides a simple and intuitive interface to manage your databases and execute various SQL statements.
This short tutorial helps you to set up MariaDB and phpMyAdmin and persisting your databases using a docker-compose.
The pythonic way
SQL is a kind of programming language (actually, it is a so-called non-procedural programming language) and it is very different from Python. The easiest way to start using SQL in your Python projects is using the pymysql package, which lets you easily connect to your database,. On the top of that, you can write SQL statements as simple strings, which are passed to a function that sends them to the database engine for execution.
Using string variables to store your SQL statements isn’t pythonic. Although you can use f-strings to substitute parts of your expressions to your Python variables, this method can become very tedious esp. when you are working with complex statements. SQLAlchemy is the de facto standard way to use SQL in Python programs. It comes in two flavors, namely Core and ORM (which stands for object relational mapping). ORM is very advanced, hence chances are high that you won’t need it as a data scientist. Core provides you with the ability to use SQL statements as methods, so you can even chain them together. Also, you can use strings as SQL statements, aka “textual SQL” too. Using SQLAlchemy Core makes your code more pythonic and readable, which means a more maintainable code. If you don’t want to switch from pymysql to SQLAlchemy later, you can start using Core’s textual SQL and later you can gradually transist to Core objects and their methods. This part of the official documentation of the toolkit is a pretty nice intro into using Core.
DataFrames and SQL tables – How to integrate all this into your workflow?
You can easily make a pandas DataFrame from an SQL table and vice versa. This short tutorial shows you how easy it is to achieve this.
Resources
Although there are plenty of tutorials on the net, and we linked some of them in this post, we strongly recommend the following two books.
- Learning SQL, 2nd edition by Alan Beaulieu: This title is a short, practice oriented intro into SQL. It is language and implementation agnostic and despite its age it is superb.
- Essential SQLAlchemy, 2nd edition by Myers and Copeland: SQLAlchemy has got an extensive and very usable documentation, but it lacks user-friendly tutorials. This book is the only comprehensive intro into SQLAlchemy, as per our best knowledge.
Image sources
- Header image https://cdn.pixabay.com/photo/2017/06/12/04/21/database-2394312_960_720.jpg
- phpMyAdmin logo https://upload.wikimedia.org/wikipedia/commons/thumb/4/4f/PhpMyAdmin_logo.svg/115px-PhpMyAdmin_logo.svg.png
- Docker logo https://upload.wikimedia.org/wikipedia/commons/c/c9/MariaDB_Logo.png
- SQLAlchemy logo https://quintagroup.com/cms/python/images/sqlalchemy-logo.png
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Top comments (0)