Introduction
With vast amounts of data generated daily, understanding what the data means can be challenging. Through wider data analytics field, you can gain insights into the data's secrets. As a data scientist or data analyst, libraries will be your greatest companions as they simplify the data science processes. There are several data science libraries that you will encounter in your data analytics journey and below are some of the most common ones. Let’s have a look at some of the libraries that you will need as a beginner!
Popular Python data science libraries
Data wrangling libraries
a) Pandas
Pandas is your friend when it comes to manipulating data in table format. In data science, we call data in a table format a data frame. A data frame comprises rows and columns and this can be used to manipulate data by using operations like merge, join, concatenate, or groupby. You can use pandas to clean, explore, analyze, and manipulate data. Pandas work with data stored in databases or spreadsheets.
Here’s an example of how Pandas works:
import pandas as pd
#let us create a dataframe
data = {'Name': ['John', 'Alice', 'Bob', 'Jane'],
'Age': [25, 30, 35, 40],
'Education':['Bsc', 'Masters', 'College', 'highschool'],
'Salary': [50000, 60000, 70000, 80000]}
df = pd.DataFrame(data)
#now we have a dataset/dataframe
# let us now filter the high paying employees
high_salary_employees = df[df['Salary'] > 60000]
high_salary_employees
In the code above, the pandas was used to create a dataframe and then filter the dataframe to meet a specific condition which in this case was to output employees earning more than 60000 a month.
b) NumPy
When it comes to manipulating data in array format, NumPy is your best friend. This library is useful when dealing with multidimensional matrices and large arrays. Some basic operations that you can perform using NumPy include multiplying, slicing, flattening, indexing, reshaping, and adding arrays.
Here’s an example of a code using NumPy:
import numpy as np
#create the array
a = np.array([[1,4,6],[3,5,7]], dtype=int)
print('The array created is:\n', a)
The array created is:
[[1 4 6]
[3 5 7]]
In the code above, we have created a 2-dimensional NumPy array. The result printed out is the output of the code.
Data visualization libraries
a) Matplotlib
Visualization is the easiest way of seeing patterns, trends, or relationships between variables in your data. With Matplotlib, you can create plots like histograms, line plots, scatterplots, pie charts, and bar charts among others. Matplotlib is customizable and as such, you can easily customize it with a little code to fit the plots you want.
Here’s a small code showing Matplotlib visualization:
import matplotlib.pyplot as plt
x = [5,10,15,20,25]
y = [30,35,40,45,50]
plt.plot(x, y)
plt.title('A plot of X against Y')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()
The output is:
The visual above shows the X against Y plot represents a typical line plot created using matplotlib. Apart from making the plot, matplotlib also allows you to create labels for the X and Y axes in addition to the title for the plot.
b) Seaborn
This is another visualization library and it is integrated with both Pandas and Numpy. Seaborn has plotting functions that allow it to operate on arrays and data frames. With Seaborn, you can do statistical aggregations to create informative plots according to user needs. The data graphics that come with Seaborn include pie charts and scatter plots.
Here’s an example of how Seaborn works:
import seaborn as sns
import matplotlib.pyplot as plt
x = [5,10,15,20,25]
y = [30,35,40,45,50]
sns.line plot(x=x, y=y)
plt.title('A plot of X against Y')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()
The output of this code is;
Seaborn serves the same function as matplotlib and as such, they use almost the same syntax. For instance, in the plot above, the labeling of the axes and the title was done using matplotlib but the plot itself was done using seaborn.
c) Plotly
If you want to take your visualization game to another level, then Plotly is your tool of choice. Plotly offers interactive visualizations in addition to having many unique chart types like histograms, sparklines, scatter plots, line charts, box plots, and bar charts. Additional benefits of using Plotly include counter plots which other visualization libraries lack.
Here’s an example of how Plotly works:
import plotly.express as px
#using the inbuilt iris flower dataset
flower = px.data.iris()
#plotting a bar chart
plot = px.bar(flower, x="petal_width", y="petal_length")
# showing the plot
plot.show()
Below is the output:
Just like matplotlib and seaborn, plotly also produces good plots. In the plot above, plotly did the work from plotting to labelling.
Machine Learning Libraries
a) Scikit-Learn (Sklearn)
Sklearn is an important tool for your machine-learning needs, whether you are a beginner or an expert. Built on top of Scipy, Numpy, and Matplotlib, this library is efficient in performing machine learning tasks. Sklearn contains both supervised and unsupervised machine learning algorithms. Some of the machine learning models that this library contains include Regression, Support Vector Machines, Clustering, Naïve Bayes, Random Forests, Classification, Nearest Neighbors, and Decision Trees.
Here’s an example of how Scikit-learn works:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Loading the inbuilt Iris dataset
iris = load_iris()
# Creating a dataframe from the dataset
iris = pd.DataFrame(data=iris.data, columns=iris.feature_names)
# splitting the features from the target variable
X = iris.drop(columns = ['petal width (cm)']) # specifying the feature variable
y = iris['petal width (cm)'] # specifying the target variable
# Split the iris dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Instantiating and fitting the linear regression model to the training data
LR = LinearRegression()
LR.fit(X_train, y_train)
# Making predictions on the test set
y_pred = LR.predict(X_test)
# Calculating the coefficient of determination (R^2) to evaluate the model
r_squared = LR.score(X_test, y_test)
print("Coefficient of determination (R^2):", r_squared)
# Print the coefficients
print("Coefficients:", LR.coef_)
# Plotting the actual vs. predicted values
plt.scatter(y_test, y_pred, color='blue')
plt.title('Actual vs. Predicted Petal Width')
plt.xlabel('Actual Petal Width (cm)')
plt.ylabel('Predicted Petal Width (cm)')
plt.show()
The output:
Coefficient of determination (R^2): 0.9407619505985545
Coefficients: [-0.25113971 0.25971605 0.54009078
In the example above, we used linear regression which is found in the Scikit learn library. The output of the code shows how the features are influencing the predictor variable. The coefficient of determination (R^2) is 0.941 means that 94.1% of the variance in petal width is a result of the independent variables (features) that include petal length, sepal length, and sepal width. In the Linear Regression model, a higher R-squared value is good for the model as it shows that the model has a better fit of the data.
Coefficients show the weight for each of the independent variables. In this model, there were 3 independent variables and their coefficients are -0.25113971, 0.25971605, and 0.54009078 which are the coefficients of sepal length, sepal width, and petal length respectively.
Popular R data science libraries
Data wrangling libraries
a) Dplyr
This library is for the transformation and manipulation of data. With it, you can perform basic data operations like selecting, mutating, summarizing, joining, filtering, and grouping data frames. You can use Dplyr to clean and wrangle data.
Here’s an example of how Dplyr works:
#loading libraries
library(dplyr)
#load the inbuilt cars dataset
data(mtcars) #this is an inbuilt dataset on cars
select_data <- select (mtcars, hp, mpg, disp)
head(select_data)
Below is the output:
In the example above, we have used dplyr to select some features from the dataset like hp, mpg and disp. Generally, the code extracts a subset of the relevant data from the dataset so we can focus on specific aspects of the data. Extracting a few features from the dataset allows us to create visualizations with ease and also analyze smaller data.
b)Lubridate
If you have date-related variables to transform then Lubridate is your friend. Without this library, working with date-time models would have been frustrating. Lubridate has functions like hour (), month (), minute (), year () and second (). You can use this library to calculate durations, intervals, and age among other time-related measures accurately. This library will greatly help you wrangle and clean time-related data.
Here’s an example of how Lubridate works:
#loading the necessary library
library(lubridate)
#creating a date-time data
dates <- as.Date(c("2024-01-01", "2024-04-01",
"2024-02-19", "2024-03-09",
"2024-03-25"))
year <- year(dates)
month <- month(dates)
day <- day(dates)
print(data.frame(dates, year, month, day))
The output:
The code takes date strings and converts them into usable date objects. The date object has been separated into components like year, month and day which is easier for further analysis. As you can see from the output, the data is more structured with the extraction of the date components.
Data visualization libraries
a)Ggplot2
The most popular library for visualization and it is based on the grammar of graphics. By mapping data attributes to visual elements like colors and lines, ggplot2 creates informative plots. The library also supports themes, facets, layers, and scales and this gives you control over the layout and appearance of your plots.
Here’s an example of how Ggplot2 works:
#loading the library
library(ggplot2)
#loading the dataset
data(mtcars)
#creating a scatterplot of mpg and hp
ggplot(data=mtcars, aes(x=hp, y=mpg))+
geom_point()+
labs(title="Scatterplot of mpg vs hp",
x="mpg",
y="hp")
The output:
The code creates a scatter plot using the ggplot2 library. The above plot visualizes the relationship between the fuel efficiency (mpg) and engine power (hp) for the vehicles in mtcars dataset.
b)Plotly
Just like in Python, Plotly is for creating interactive visualizations. This library offers many options for visualizing data like traditional plots and advanced ones like 3D charts and heat maps. This library will come in handy when you want to create interactive plots.
Here’s an example of how Plotly works:
#loading the library
library(plotly)
#loading the data
data(mtcars)
#plotting a histogram of miles per gallon (mpg)
plot_ly(data = mtcars, x = ~mpg, type = "histogram", marker = list(color = "skyblue"), color = I("black")) %>%
layout(title = "Histogram of MPG",
xaxis = list(title = "MPG"),
yaxis = list(title = "Frequency"))
The output:
The plotly code above creates an interactive histogram for displaying the distribution of the mpg (miles per gallon) feature of the mtcars dataset. This histogram offers insights into the number of cars with the same fuel efficiency ratings.
ML model libraries
a) Caret
This package is a supervised machine-learning library. It is mainly used for classification and regression problems and its name is a short form of Classification and Regression Training. Caret has many functions like creating DataPartition and train control which are used for splitting data and performing cross-validation respectively.
Here’s an example of how Caret works:
#loading the library
library(caret)
#loading the data
data(iris)
#splitting the data into training and test sets
set.seed(42)
train_index <- createDataPartition(iris$Species, p = 0.8)
train_data <- iris[train_index[[1]],]
test_data <- iris[-train_index[[1]],]
#training the decision tree model
model <- train(Species ~., data=train_data, method='rpart')
#making predictions on the test data
predictions <- predict(model, newdata=test_data)
#model evaluation
confusionMatrix(predictions, test_data$Species)
The output:
The code above demonstrates the use of R’s Caret library for machine learning. In the example above, we trained the decision tree model to classify the flower species in the iris dataset. The code then evaluates the performance of the model using the test data and creates a confusion matrix. This matrix shows us how the model performed in correctly predicting the flower types. From the output above, the model has an accuracy of 93.33%, an indication that the model performed fairly well.
Conclusion
In this article, you have learned about some of the most popular data science libraries in R and Python. These libraries like Dplyr, Plotly, Ggplot2, Caret, Scikit-learn, Pandas, and NumPy among others play an important role in helping data scientists explore, manipulate and make statistical models. These libraries are not exhaustive and you can find other libraries with additional research. Happy learning ahead!
Top comments (0)