DEV Community

Cover image for Introduction to Data Analysis with Python Part 4: Data Visualisation with Matplotlib
dev_neil_a
dev_neil_a

Posted on • Edited on

Introduction to Data Analysis with Python Part 4: Data Visualisation with Matplotlib

Introduction

In this final part of the multi-part series, I'll be showing you how to create some basic charts using Matplotlib.

The data for the charts will come from a number of Pandas dataframes that will be created from the data that was used in part three.

To recap what was covered previously:

In part one, I covered importing data from a CSV file, cleaning up & converting data and finally exporting it to an Excel file.

In part two, I covered performing mathematical operations against the data that is stored in a dataframe using both Pandas and NumPy

In part three, I covered how to perform analytical operations against data in a Pandas dataframe to show data that could be used for reporting, such as a total for example.

As before in the previous parts, there is a Jupyter notebook, along with all the other required files located in a GitHub repo that is linked in the Resources section.

With that said, let's get started on the series finale. Spoilers, there is no cliffhanger!

Step 1. Importing Pandas and NumPy

First of all, the Pandas and NumPy libraries need to be imported. In addition, Matplotlib will also be imported as it will be required for creating the charts.

# --- %matplotlib inline will ensure that the plots (charts) and figures show up in the notebook.
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
Enter fullscreen mode Exit fullscreen mode

If you don't have matplotlib installed, you can install it using pip:

pip install matplotlib
Enter fullscreen mode Exit fullscreen mode

Step 2. Import From Excel

Once the libraries have been imported, the next step is to get the data imported. This is the same data that was used in part three and there have been no changes to it.

sales_data = pd.read_excel(io = "data/order_data_with_totals.xlsx",
                           sheet_name = "order_data_with_totals",
                           dtype      = {"order_id": np.int64,
                                         "order_date": "datetime64",
                                         "customer_id": np.int64, 
                                         "customer_first_name": str,
                                         "customer_last_name": str,
                                         "customer_gender": str,
                                         "customer_city": str,
                                         "customer_country": str,
                                         "item_description": str,
                                         "item_qty": np.int64,
                                         "item_price": np.float64,
                                         "order_currency": str,
                                         "order_vat_rate": np.float64,
                                         "order_total_ex_vat_local_currency": np.float64,
                                         "order_total_vat_local_currency": np.float64,
                                         "order_total_inc_vat_local_currency": np.float64,
                                         "order_currency_conversion_rate": np.float64,
                                         "order_total_ex_vat_converted_gbp": np.float64,
                                         "order_total_vat_converted_gbp": np.float64,
                                         "order_total_inc_vat_converted_gbp": np.float64})
Enter fullscreen mode Exit fullscreen mode

Step 3. Validating the Data

Now that the data has been imported from the Excel file into the sales_data dataframe, let's take a look at the data it contains.

Step 3.1. What the Data Looks Like

First, let's have a look at some of the data in the first five rows of the data in the sales_data dataframe.

sales_data.head(n = 5)
Enter fullscreen mode Exit fullscreen mode

output from above

There are more columns in the sales_data dataframe but it would be too long to fit into an image.

Step 3.2. Check the Columns DataTypes

Next, let's have a look at the datatypes that have been assigned to each column in the sales_data dataframe.

sales_data.dtypes
Enter fullscreen mode Exit fullscreen mode

output from above

As expected, all the datatypes match to what they were specified to be when they were imported.

Step 3.3. Check for NaN (Null) Values

sales_data.isna().sum()
Enter fullscreen mode Exit fullscreen mode

output from above

Just as before, there are no NaN values in the dataframe.

Now, let's move on to creating some additional dataframes from the sales_data dataframe that can then be used to create some charts.

Step 4. Creating New Dataframes for Charting

In this section, two new dataframes will be created that will be used for creating the charts. The first dataframe will cover the total number of orders by the currencies that were used and the second will be a list of orders placed by the customers gender.

Unlike in part three, the two dataframes will each be assigned to a variable so they can be referenced when it comes to creating the charts.

Step 4.1. Create Total Number of Orders by Currency Dataframe

# --- Create a variable for the dataframe:
orders_by_currency_df = sales_data.groupby(["order_currency"])\
                                  .size()\
                                  .to_frame("total_number_of_orders")\
                                  .sort_values("total_number_of_orders", 
                                               ascending = True)


# --- Show the contents of the dataframe:
orders_by_currency_df
Enter fullscreen mode Exit fullscreen mode

output from above

Step 4.2. Create Total Number of Orders by Gender Dataframe

orders_by_gender_df = sales_data.groupby(["customer_gender"])\
                                .size()\
                                .to_frame("no_of_orders")\
                                .sort_values("no_of_orders", 
                                             ascending = False)


# --- Show the contents of the dataframe:
orders_by_gender_df
Enter fullscreen mode Exit fullscreen mode

output from above

Step 5. Creating Charts with Matplotlib

So what is Matplotlib? Matplotlib is a library that is used by Python to create charts from data that can come from many different sources. In the examples in this article, the data sources will be the two dataframes that were create earlier from the sales_data dataframe.

A Matplotlib chart consists of a number of elements. The below diagram depicts what each element is.

output from above

  • Figure: This of this as a canvas that the chart(s) is / are placed onto.
  • Figure Title: The title of the figure. This can be different to the title given to a chart (or axes). This is not shown on the above example.
  • Axes: An axes is the container for a chart (also called a plot). An axes sits on top of the figure and there can be more than one axes on a figure.
  • Axes Title: This is the title for the axes.
  • Y-Axis Label: The label that describes what the y-axis represents.
  • X-Axis Label: Does the same as the y-axis label, only it's for the x-axis.
  • Tick: What the data represents from the data source (for example, what currency does the bar represent).
  • Legend: A list of what each data point on the plot/chart is.

As part of each chart, I've added notes for each section of the code to describe what it does or what its purpose is.

First, let's begin by looking at making a bar chart from the orders_by_currency_df dataframe.

Step 5.1. Orders by Currency as a Bar Chart

There are two ways that you can create charts (plots) with Matplotlib. The first one is a simple API called plot that will create everything for you and does have a level of customisation available. For example, let's create a quick bar chart using the orders_by_currency_df dataframe:

orders_by_currency_df.plot(kind = "bar")
Enter fullscreen mode Exit fullscreen mode

output from above

As you can see, it is a basic bar chart that shows pretty much the data. But let's say we want to create a bar chart using a method that is more object-orientated and offers the maximum amount of customisation available. This is what will be used in the examples going forward, starting with the below bar chart.

# --- Create a list of colors to use for each item in the dataframe.
colors_to_use = ["blue", "red"]


# --- Setup the figure and the axes:
fig, ax = plt.subplots(figsize = (10, 8))


# --- This sets the figure to white. There seems to be a bug in VS Code that can cause
# --- the figure to go dark when using dark mode passed through from the operating system.
fig.set_facecolor("white")


# --- Customise the bar chart axes:
ax.set_title(label    = "Total Orders by Currency", 
             fontdict = {"fontsize": 20,
                         "color": "black",
                         "weight": "bold"})


# --- Set the x axis label:
ax.set_xlabel("Currency", fontsize = 16)
plt.xticks(fontsize = 16)


# --- Set the y axis label:
ax.set_ylabel("Number of Orders", fontsize = 16)
plt.yticks(fontsize = 16)


# --- Create a bar plot:
bar_chart = ax.bar(x          = orders_by_currency_df.index.values, 
                   height     = orders_by_currency_df["total_number_of_orders"],
                   color      = colors_to_use,
                   tick_label = orders_by_currency_df.index)


# --- Set the label for each bar to appear inside each bar with the value of each currency:
ax.bar_label(container  =bar_chart, 
             label_type = "center",
             labels     = orders_by_currency_df["total_number_of_orders"],
             color      = "white",
             weight     = "bold",
             fontsize   = 16)


# --- Create a dictionary that maps the currency to the color used.
# --- These will be used in the the legend.
currency_cmap = dict(zip(orders_by_currency_df.index.values, 
                         colors_to_use))

patches = [Patch(label = currency, 
                 color = currency_color) for currency, currency_color in currency_cmap.items()]


# --- Add a legend:
ax.legend(handles    = patches,
          fontsize   = 16,
          labelcolor = "black",
          title      = "Currency",
          title_fontproperties = {"size": 16,
                                  "weight": "bold"});
Enter fullscreen mode Exit fullscreen mode

output from above

You may have noticed that there is a semi-colon at the end of the last line of the code, which is unusual with Python.

The reason for this is that the default behaviour for Matplotlib is to show the object name above the chart. Adding the semi-colon will suppress this so you will only see the chart.

Note: If you look at the first bar chart at the beginning of this step, you will see the object name.

Step 5.2. Percentage of Orders by Currency as a Pie Chart

Now let's take the same data used for the bar chart and create a pie chart from it.

# --- Create a list of colors to use for each item in the dataframe.
colors_to_use = ["blue", "red"]


# --- Setup the figure and the axes:
fig, ax = plt.subplots(figsize = (14, 10))


# --- This sets the figure to white. There seems to be a bug in VS Code that can cause
# --- the figure to go dark when using dark mode passed through from the operating system.
fig.set_facecolor("white")


# --- Customise the pie chart plot:
ax.set_title(label    = "Total Orders by Currency (%)", 
             fontdict = {"fontsize": 20,
                         "color": "black",
                         "weight": "bold"})


# --- Create a pie chart plot.
# --- explode will take one of the pieces out of the pie slightly.
# --- autopct will format the percentages to two decimal points:
patches, texts, pcts = ax.pie(x         = orders_by_currency_df["total_number_of_orders"], 
                              labels    = orders_by_currency_df.index.values,
                              explode   = (0.2, 0),
                              autopct   = '%0.2f%%',
                              shadow    = False,
                              colors    = colors_to_use,
                              textprops = {"fontsize": 16,
                                           "weight": "bold"})


# --- Set the color of the percentage to white:
plt.setp(pcts, 
         color  = "white",
         weight = "bold")


# --- This will change the color of the text label for each slice to the color the slice used:
for index_pos, patch in enumerate(patches):
    texts[index_pos].set_color(patch.get_facecolor())


# --- Add a legend:
ax.legend(fontsize = 16,
          title    = "Currency",
          loc      = "upper left",
          title_fontproperties = {"size": 16,
                                  "weight": "bold"});
Enter fullscreen mode Exit fullscreen mode

output from above

Step 5.3. Percentage of Orders by Gender as a Pie Chart

Lastly, let's create another bar chart, this time using the orders_by_gender_df dataframe.

# --- Create a list of colors to use for each item in the dataframe.
colors_to_use = ["red", "blue","purple"]


# --- Setup the figure and the axes:
fig, ax = plt.subplots(figsize = (16, 12))


# --- This sets the figure to white. There seems to be a bug in VS Code that can cause
# --- the figure to go dark when using dark mode passed through from the operating system.
fig.set_facecolor("white")


# --- Customise the pie chart plot:
ax.set_title(label  = "Total Orders by Gender (%)", 
             fontdict = {"fontsize": 20,
                         "color": "black",
                         "weight": "bold"})


# --- Create a pie chart plot.
# --- explode will not take any of the pieces out of the pie.
# --- autopct will format the percentages to two decimal points:
patches, texts, pcts = ax.pie(x = orders_by_gender_df["no_of_orders"], 
                              labels = orders_by_gender_df.index.values,
                              autopct = '%0.2f%%',
                              explode = (0.0, 0.0, 0.0),
                              shadow = False,
                              colors= colors_to_use,
                              textprops = {"fontsize": 16,
                                           "weight": "bold"})


# --- Set the color of the percentages to white:
plt.setp(pcts, 
         color = "white",
         weight = "bold")


# --- This will change the color of the text label for each slice to the color the slice used:
for index_pos, patch in enumerate(patches):
    texts[index_pos].set_color(patch.get_facecolor())


# --- Add a legend:
ax.legend(fontsize = 16,
          labelcolor = "black",
          title = "Gender",
          loc = "upper left",
          title_fontproperties = {"size": 16,
                                  "weight": "bold"});
Enter fullscreen mode Exit fullscreen mode

output from above

Conclusion

In this final part in the series, I covered how to use an object-oriented way of using Matplotlib to allow you to create some basic charts.

There are many different options you use to customise charts, be that color themes, visualisation styles (such as histogram, scatter plots and line graphs to name a few) and more.

I would recommend checking the Matplotlib documentation to see what other possibilities are available for you to use.

Thank you for reading and have a good day!

Resources

GitHub files for part 4

Top comments (0)