As a data analyst, scientist, or business analyst, data visualization is your friend when you want to see trends in your data. Without visualization, seeing patterns and trends in your data will be a big challenge. With data visualization, you can easily visualize your data and spot trends. Some of the visuals that you are likely to use in your data analysis work include charts, histograms, line plots, scatter plots, tree maps, heatmaps, and box plots among others.
Why visualization?
Data visualization is beneficial in many ways:
- It offers stakeholders a clear visual of the data allowing them to understand data insights.
- It helps stakeholders understand the relations between the variables. Some of the relationships of interest include trends, correlations, and connections
- Visualizations make it easier to spot inaccuracies in the data by offering visual representations. This helps data scientists prepare the data by ensuring there are no missing values or outliers before passing the data through machine learning models.
There are many data visualization tools that you will encounter in your data analysis journey. Some of the tools include Tableau, Looker, Microsoft Excel, Power BI, Google Data Studio, and programming languages like Python and R. For this article, we will talk about the data visualization tools in R and Python.
Python visualization tools
Some of the popular data visualization libraries in Python are Matplotlib, seaborn, and Plotly.
Matplotlib
This is the widely used visualization library in Python. Matplotlib is the first visualization tool in Python, other visualization libraries in Python are built on it. Some of the features of this library include supporting graphical representations like bar plots, scatter plots, histograms, scatter plots, area plots, pie charts, and line lots.
Code:
import statsmodels.api as sm
import matplotlib.pyplot as plt
# Listing the available datasets in statsmodels
print(sm.datasets.__all__)
# Accessing the state crime dataset belonging to stats models
statecrime = sm.datasets.statecrime.load()
#assigning statecrime data to crime_df
crime_df = statecrime.data
#matplotlib plot for murder vs poverty
plt.figure(figsize =(10,6))
plt.scatter(crime_df['murder'], crime_df['poverty'])
plt.title('Scatter plot of murder against poverty')
plt.xlabel('Murders')
plt.ylabel('poverty')
plt.show()
Code Explanation:
The first thing to do is to load the necessary libraries. In this case, we load stats models containing the dataset that we will use and matplotlib for creating visualizations.
The second thing to do is to list the available datasets and choose the one to use. For this visualization, we are using the statecrime dataset. We load the dataset and then assign it to the crime_df data frame.
Finally, we create the visualization which in this case is a scatterplot showing the relationship between ‘murder’ and ‘poverty’ variables. In this section, we first set the plot size, then plot the scatter plot, then put labels like title, xlabel and ylabel and we end with displaying the plot using plt. show.
Plot:
A look at the scatterplot above shows that there is a positive relationship between poverty and murder rates.
Seaborn
Seaborn is a visualization library for generating statistical graphs. The library is built on Matplotlib and therefore does not have the limitations associated with Matplotlib. When you want to understand how variables in a dataset are related, you use statistical analysis as this will show you the trends and patterns in the dataset. You can get visuals using Seaborn: line, scatter, point, count, violin, KDE, bar, swam and box plots. It is key to acknowledge that Seaborn was created to work with Matplotlib. Check this site for more visuals on the Seaborn library.
Code:
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
# Listing the available datasets in statsmodels
print(sm.datasets.__all__)
# Accessing the state crime dataset belonging to stats models
statecrime = sm.datasets.statecrime.load()
#assigning statecrime data to crime_df
crime_df = statecrime.data#creating the plot
plt.figure(figsize=(10,6))
sns.scatterplot(x='single', y='violent', data=crime_df)
plt.title('Scatterplot of single people and violence')
plt.xlabel('Single')
plt.ylabel('Violence')
plt.show()
Code explanation:
The first step as usual is loading the libraries which are statsmodels, Matplotlib and Seaborn. We also load the dataset from the stats model and assign it to the crime_df data frame.
The next step is using seaborn to create the scatter plot that shows the relationship between single people and violence. The scatter plot is created using the ‘sns. scatterplot()’ function. This function takes parameters like x-axis and y-axis parameters and the dataset. Next, the labels are added which are the title, xlabel and the ylabel
Finally, the ‘plt. show’ is called which displays the scatterplot.
The plot:
The plot shows that violence and singleness are positively correlated.
Plotly
Plotly is another visualization library for producing interactive plots. For these interactive plots, you can zoom in and out to get a clearer picture of the relationship between variables or the distribution of a variable. Some of the benefits you will get for using Plotly include having the capability to detect outliers using the hover tool and also endless customization of the graphs to make the plots more understandable. Check out these interactive plotly visuals.
Code:
import statsmodels.api as sm
import plotly.express as px
# Listing the available datasets in statsmodels
print(sm.datasets.__all__)
# Accessing the state crime dataset belonging to stats models
statecrime = sm.datasets.statecrime.load()
#assigning statecrime data to crime_df
crime_df = statecrime.data
#creating the plot
fig=px.scatter(crime_df, x='hs_grad', y='poverty', size='violent',
hover_name=crime_df.index,
title='Scatter Plot of high school graduation vs poverty')
fig.show()
Code explanation:
The first step is importing the relevant libraries and loading the dataset.
The next step is creating the scatterplot. In this section, ‘Plotly Express (px) creates the interactive scatter plot of high school graduation and poverty. The px.scatter takes parameters like the dataset, x-axis, y-axis, and title. Size is for the marker and can be big or small depending on the violence rate.
The final part shows the interactive plot and this is represented by ‘fig. show()’.
The plot:
The plot above shows that poverty is negatively related to high school graduation. An increase in poverty rates is negatively correlated to high school graduation rates. That is, areas with high poverty rates tend to have low graduation rates.
R Visualization tools
The commonly used visualization R libraries are ggplot2 and plotly.
Ggplot2
Ggplot2 is built on the grammar of graphics. This library is used in creating visualizations like error charts, scatter plots, histograms, pie charts, and bar charts. With ggplot2, you can add various layers of aesthetics to your visualization as per your needs.
Code:
library(tidyverse)
Arrests <- USArrests
ggplot(arrests, aes(x=Assault, y=Murder, label = rownames(arrests)))+
geom_point(color = "darkred")+
labs(title="Murder vs. Assaults per region", x="Assaults", y="Murders")+
geom_text(nudge_x = 0.5, check_overlap = TRUE)
Code Explanation:
The first step is loading the libraries and then the data. The library used for this project is ‘tidyverse’ and the dataset is an inbuilt R dataset named ‘USArrests’ which contains data on arrests made in the US.
The second step is creating a scatterplot. After initializing ‘ggplot’ which is a function in the ‘tidyverse’ library, we specify the dataset ‘arrests’ and the x-axis ‘Assault’ and y-axis ‘Murder’ variables. The ‘label=’rownames(arrests) adds the labels for each of the points in the plot. At this point, the scatter plot can be considered complete since the rest of the stages are for giving it aesthetics and making it visually appealing.
The third step is adding points and labels. The ‘geom_point(color=”darkred”) color the points in the plot darkred while the ‘geom_text(nudge_x=0.5, check_overlap=TRUE’ adds text labels, nudges the texts to the right to avoid the texts from overlapping.
The final code is for labeling the scatterplot. The code gives the scatterplot a title, x-label and y-label.
Below is the scatterplot:
In this plot, Murders and Assaults are directly correlated. States near 0 in both the x and y axes have low murder and assault rates while those in the top right end have higher murder and assault rates.
As explained above, you can see that each point is colored dark red and has text indicating the state where the assault happened. The beauty of using ggplot is that it allows for the customization of the plot.
Plotly
Plotly in Python is the same as Plotly in R. Their functionalities are the same in that apart from producing a wide array of plots, the library also produces interactive plots.
Code:
library(plotly)
arrests <- USArrests
plot_ly(
data=arrests,
x = ~Murder,
y = ~Assault,
text = ~paste("State:", rownames(arrests), "<br>Murders:", Murder,"<br>Assault:", Assault),
mode="markers",
marker = list(size=10, color="darkblue", opacity=0.7)
)%>%
layout(title="Crimes per state",
xaxis = list(title="Murder"),
yaxis = list(title="Assault"),
hovermode = "closest")
Code Explanation
The first step is loading the Plotly library and the data. The plotly library allows for the plotting of interactive visualizations.
The next step is creating the interactive scatterplot. The ‘plot_ly() function is for creating the plot and in it, we specify the dataset (arrests), and the x and y-axis variables which are ‘murder’ and ‘assault’ respectively. The ‘text’ parameter within plot_tly is for displaying information when one hovers over the points in the plot.
The third step is customizing the plot's appearance. The ‘mode’ specifies that the scatterplot be displayed with markers while the ‘marker’ controls how the markers appear.
The last step is setting the layout options which customize the appearance of the plot. The title, x-axis, and y-axis labels are the texts that will be displayed while the hovermode indicates that hover information nearest to the cursor will be displayed.
Plotly scatter plot
The plot above shows that assaults and murder rates are positively correlated. If you hover the cursor in the plot after running the code above on your computer, you will see that states with small murder and assault rates are closer to 0 in both the x and y axes. If you progressively move towards the top-right of the plot, you will see that the murder and assault rates are increasing.
Hovering over the points will not display any information since this is a ‘.png’ picture. However, if you run the code above, you will be able to have an interactive visualization in your R studio.
Choosing the right visualization tool
Picking the ideal tool for your visualization needs can be challenging since both R and Python have robust libraries that will support your visualization needs. Let us look at the strengths and weaknesses of each language for data visualization:
- R was designed for statistical data analysis and as such, it has libraries that will create beautiful visualizations to meet your needs. In particular, the ggplot2 package is known for its aesthetics, and extensive chart types and this makes R good for data visualization.
- Python’s versatility as a language extends to the powerful visualization libraries of Matplotlib and Seaborn. Even though using these libraries can be confusing compared to when using ggplot2, the learning curve for Python is relatively easy for beginners. Also, the existence of Python’s extensive data science libraries makes it ideal for projects that go beyond visualization.
Ultimately, the choice of the tool to use for your visualization project will depend on your needs. If you want to perform statistical data analysis and visualization, then R will be a good choice for you. However, if you are interested in data science applications and you want to create your visuals after a short learning period, then Python fits your needs.
Additional resources for your visualization needs
- https://www.analyticsvidhya.com/blog/2022/03/a-comprehensive-guide-on-ggplot2-in-r/
- https://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html
- https://www.geeksforgeeks.org/matplotlib-tutorial/
- https://www.geeksforgeeks.org/python-seaborn-tutorial/
- https://plotly.com/python/plotly-express/
- https://www.geeksforgeeks.org/interactive-charts-using-plotly-in-r/
Top comments (0)