DEV Community

Cover image for How to Build an Interactive Bubble Map in Python Using Plotly
Kedar Ghule
Kedar Ghule

Posted on • Edited on

How to Build an Interactive Bubble Map in Python Using Plotly

In this tutorial, we will be creating a county-level geographic bubble map of the active COVID-19 cases in the United States. First of all, let us understand what a Bubble Map is!

What is a Bubble Map?

Bubble maps are a kind of geographic visualization that draws their roots from the bubble charts. In bubble charts, the bubbles are plotted on a Cartesian plane. In the case of bubble maps, these bubbles are plotted on geographic regions. The size of the bubble over the geographic area is proportional to the value of a particular variable. Bubble maps are important as they are one of the best ways to compare proportions over a geographic region.

Building a Bubble Map Using Plotly

Let us dive straight into the tutorial now. Throughout this tutorial, we will also do some basic exploratory data analysis and data cleaning.

1. Importing Libraries

The first step is to import the necessary libraries we will need throughout this tutorial. We will be using the popular python data analysis library called 'Pandas' and our data visualization library - Plotly. We need to import specifically a class called graph_objects from plotly.



import pandas as pd
import plotly.graph_objects as go


Enter fullscreen mode Exit fullscreen mode

2. Loading Our Dataset
Next, we import our dataset and store it into a DataFrame. The dataset I am using is by Johns Hopkins University and can be found here. When this code was written, the dataset for the 6th March 2021 was the last dataset that included the active COVID-19 cases count. It seems like Johns Hopkins removed the active and recovered cases data for datasets after 6th March 2021.



df = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/03-06-2021.csv",
                dtype={"FIPS": str})
df.head()


Enter fullscreen mode Exit fullscreen mode

This is what our DataFrame looks like -
load the dataset

3. Exploratory Data Analysis and Data Cleaning

Now you can see that this DataFrame has data for other countries as well. Since we are focusing only on the United States data for this tutorial, let's filter to only the US data and update our DataFrame.



df = df[df.Country_Region == "US"]
df.head()


Enter fullscreen mode Exit fullscreen mode

Now, our updated DataFrame looks like this -
updated dataframe

Let us now explore the data further. First, let us find the length of the DataFrame. Since we are going to make a bubble map for the active COVID-19 cases in the US, let us check the maximum and minimum values in the Active column. The Active column contains the data for the active COVID-19 cases.



len(df)
df.Active.max()
df.Active.min()


Enter fullscreen mode Exit fullscreen mode

Depending on the dataset you are using, you will get different values for the above statements.

exploratory data analysis 1

Wait, how can active cases be a negative number? Surely there must be something wrong. Let us see which row in the DataFrame has this data. Furthermore, let us also check what other rows have their active cases values less than 0.



df[df.Active == df.Active.min()]
df[df.Active < 0]


Enter fullscreen mode Exit fullscreen mode

exploratory data analysis 2

Ah! You can see that some unassigned rows have these values. This data needs to be cleaned from our DataFrame as it would serve us no purpose. So, we will filter out the rows which have values less than 0 in the Active column. we will also take a look at the length of the DataFrame once again and the minimum and maximum values in the Active column.



df = df[df.Active > 0]
df.head()
len(df)
df.Active.max(), df.Active.min()


Enter fullscreen mode Exit fullscreen mode

exploratory data analysis 3

Let us check for missing values in other columns before moving ahead, specifically the Admin2, Lat, and Long_ columns. The Admin2 column specifies the county name. The Lat and Long_ columns specify the latitude and longitude values for these counties. These columns will feature heavily while we work on the code for the bubble map.



df.isna().sum()


Enter fullscreen mode Exit fullscreen mode

data cleaning

So we get the number of missing values in each column of our DataFrame. Our Admin2 column has 5 missing values, while the Lat and Long_ columns have 36 missing values. Let us remove these missing values from the Admin2, Lat, and Long_ columns. They anyways won't serve any purpose to us while plotting our bubble map. We will also verify if these values have been removed or not.



df.dropna(subset=['Lat', 'Long_', 'Admin2'], inplace=True)
df.isna().sum()


Enter fullscreen mode Exit fullscreen mode

data cleaning 2

Fantastic! Our three main columns - Admin2, Lat, and Long_ do not have any missing values.

4. Sorting And Rearranging Data

Next, let us sort our DataFrame in descending order of active cases. Since the sorting rearranges the indexes of the DataFrame, we will also reset the indexes of our newly sorted DataFrame.



df = df.sort_values(by=["Active"], ascending=False)
df.reset_index(drop=True, inplace=True)
df.head()


Enter fullscreen mode Exit fullscreen mode

sort and rearrange dataframe

5. Setting Value Limit Intervals

We need to set some levels or limits to group the range of COVID-19 cases by specifying an upper bound and a lower bound of active COVID cases. For this, we create a list called stages. This stages list will be used for our bubble map's legend.
0-100 cases will be one range, 101-1000 cases will be another range, and so on.

After that, we will store the index values of rows that fall in these ranges as a list of tuples called limits.



stages = ["400000+", "300001-400000", "200001-300000", "100001-200000", "50001-100000", "10001-50000",
         "1001-10000", "101-1000", "1-100"]

# Create tuples of row indexes for the above ranges
tuple1 = (0, df[df.Active > 400000].index[-1]+1)
tuple2 = (tuple1[1], df[(df.Active > 300000) & (df.Active <=400000)].index[-1]+1)
tuple3 = (tuple2[1], df[(df.Active > 200000) & (df.Active <=300000)].index[-1]+1)
tuple4 = (tuple3[1], df[(df.Active > 100000) & (df.Active <=200000)].index[-1]+1)
tuple5 = (tuple4[1], df[(df.Active > 50000) & (df.Active <=100000)].index[-1]+1)
tuple6 = (tuple5[1], df[(df.Active > 10000) & (df.Active <=50000)].index[-1]+1)
tuple7 = (tuple6[1], df[(df.Active > 1000) & (df.Active <=10000)].index[-1]+1)
tuple8 = (tuple7[1], df[(df.Active > 100) & (df.Active <=1000)].index[-1]+1)
tuple9 = (tuple8[1], df[df.Active <=100].index[-1]+1)

limits = [tuple1, tuple2, tuple3, tuple4, tuple5, tuple6, tuple7, tuple8, tuple9]
limits


Enter fullscreen mode Exit fullscreen mode

groups and intervals

So, all rows with the value of their active cases greater than 400,000 will be in tuple1. All rows with their active cases value greater than 300,000, but less than or equal to 400,000 will be in tuple2. And so on.

6. Time to Plot our Bubble Map!

Since bubble maps show a bubble size proportional to the variable's value, it is also essential to set the right colour for the bubble. Aesthetics make a lot of difference in data visualizations. We will set a list of colours. I chose shades of red from the following link - http://www.workwithcolor.com/red-color-hue-range-01.htm. Note that the number of colours should be equal to the number of tuples we have in the limits variable.



colors = ["#CC0000","#CE1620","#E34234","#CD5C5C","#FF0000", "#FF1C00", "#FF6961", "#F4C2C2", "#FFFAFA"]


Enter fullscreen mode Exit fullscreen mode

Note that if you are using a Jupyter notebook, the below code should be in one cell. I have split it up in this blog post for explaining the code easily.



fig = go.Figure()
stage_counter = 0
for i in range(len(limits)):
    lim = limits[i]
    df_sub = df[lim[0]:lim[1]]
    fig.add_trace(go.Scattergeo(
        locationmode = 'USA-states',
        lon = df_sub['Long_'],
        lat = df_sub['Lat'],
        text = df_sub['Admin2'],
        marker = dict(
            size = df_sub['Active']*0.002,
            color = colors[i],
            line_color='rgb(40,40,40)',
            line_width=0.5,
            sizemode = 'area'
        ),
        name = '{}'.format(stages[stage_counter])))
    stage_counter = stage_counter+1


Enter fullscreen mode Exit fullscreen mode

Okay, here starts the complex part.

First, we set our stage_counter (the variable that tracks which stage we are on) to 0.

Next comes the for loop, which loops 9 times, once for every tuple in the limits variable. During each iteration, we extract a part of our original DataFrame to df_sub. The new DataFrame df_sub contains the rows whose index falls in the range specified by that tuple. During our first iteration, df_sub will contain rows with indexes - 0, 1, 2 and 3. In the same iteration, we plot the bubbles for those rows using the latitude and longitude value specified for that county under the Lat and Long_ columns. We specify the 'text' parameter as the county's name (value in Admin2 column) so that once the visualization is ready, we can hover over the bubble to see the name of the county. Next, we specify the size of the bubble proportional to the Active COVID-19 cases by multiplying the value in the Active column with 0.002. You may use a different value. This value seemed apt to me for my visualization. We also specify the colour of the bubble. The 'name' parameter will specify the trace name. The trace name appears as the legend item and on hover. For the first iteration, this value will be the first item in the stages list, i.e., "400000+". And finally, before we move to the next iteration, we increment the stage_counter by 1.

If you are confused by the parameters in the above code snippet, check out this documentation.



fig.update_layout(
        title_text = 'Active Covid-19 Cases In The United States By Geography',
        title_x=0.5,
        showlegend = True,
        legend_title = 'Range Of Active Cases',
        geo = dict(
            scope = 'usa',
            landcolor = 'rgb(217, 217, 217)',
            projection=go.layout.geo.Projection(type = 'albers usa'),
        )
    )


Enter fullscreen mode Exit fullscreen mode

Next, we focus on the aesthetics of our bubble map visualization. We set the title of the bubble map and its position (title_x=0.5 means center aligned) and the title of the legend. Since we are making a bubble map about the US COVID-19 Active cases, we specify the bubble map scope as 'usa'. For aesthetics, I changed the US landmass colour to grey using the 'landcolor' parameter.

If you have any queries about this code snippet, this plotly documentation will help you!

Finally, we save our graph on our local machine. And then, we display it on our Jupyter notebook.



fig.write_image("Active-Covid19-Cases-US-bubblemap.png", scale=2)
fig.show()


Enter fullscreen mode Exit fullscreen mode

And our bubble map is ready!

US Counties COVID-19 Bubble Map

Conclusion

You can find the code for this tutorial on my GitHub.

Thanks a lot for reading my tutorial! If you have any questions, feel free to ask me! You can also follow me on Twitter or connect with me on LinkedIn. I would also love to get some feedback on my code and my post!

Top comments (0)