Tina Huynh

Posted on May 5, 2022

10 Exciting Beginner Machine Learning Projects of 2022

#machinelearning #python #beginners #deeplearning

Zillow Home Value Prediction
Article Recommendation System
Iris Flowers Classification
Instagram Reach Analysis and Prediction
BigMart Sales Prediction
Stock Prices Predictor using TimeSeries
Waiter Tips Analysis & Prediction
Music Recommendation System
Covid-19 Deaths Prediction
Stress Detection
Helpful Links

Zillow Home Value Prediction

Zestimate is a tool that provides the worth of the house based on various attributes like public data, sales data, etc. Zestimate has information of more than 97 million homes. Zestimate is the first step to analyze the worth of a house or to check if the value has been appraised or not after newly upgrading your home, or maybe you just want to refinance it. The algorithm behind Zestimate gets its data 3 times a week, on the basis of comparable sales and publicly available data.

Building a model to improve the Zestimate residual error which is called “log error” which is the difference between the log of Zestimate price and the log of the actual sales price

log error = log(Zestimate) — log(SalePrice)

Machine Learning project Workflow:

1. Import Libraries and Loading Dataset

Here you will be using python, opendatasets, pandas, seaborn, matplotlib, ploitly, geopands, sklearn, etc.

2. Exploratory Data Analysis

Look at missing values
Illustrate distribution and outliers
Analyze

3. Fix and clean the data

You'll find around 35 columns with ~30% missing values. Data cleaning is one of the critical steps in machine learning techniques used to appropriately clean the data.

4. Data splitting

5. Baseline model training

3 models: a hard coded model that only predicts average, Linear Regression, and Decision Tree models.

6. Feature engineering & Feature selection

7. Data Pre-processing

8. Robust model Training and Hyperparameter tuning

You can train the data on models such as SkLearn ensemble Tree-based models Random Forest, Gradient Boosting, ExtraTree, and also models such as LightGBM, Catboost.

Check out this Github here for the full code and explantion

Forecasting Real Estate Prices using ML: Time Series Modeling | by Andrea Cabello | Python in Plain English

In this blog post, I present the results of my experience working on a time series forecasting project using Python.

python.plainenglish.io

Back to TOC

Article Recommendation System

There are two types of recommendation systems. Collaborative filtering and content-based filtering.

Machine Learning project Workflow:

1. Import Libraries and Loading Dataset

You'd use numpy, pandas, gdown, fastai, motplotlib, zipfile, time, google.colab, etc.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import gdown
from fastai.vision import *
from fastai.metrics import accuracy, top_k_accuracy
from annoy import AnnoyIndex
import zipfile
import time
from google.colab import drive
%matplotlib inline

2. Getting images from Google Drive

# get the images
root_path = './'
url = 'https://drive.google.com/uc?id=1j5fCPgh0gnY6v7ChkWlgnnHH6unxuAbb'
output = 'img.zip'
gdown.download(url, output, quiet=False)
with zipfile.ZipFile("img.zip","r") as zip_ref:
    zip_ref.extractall(root_path)

3. Data preparation and cleaning

4. Retrieve image embed with FastAI

5. Testing the system

Talk a look at thecleverprogrammer for the code and explanation of recommendation systems in ML.

4 Recommendation System Projects with Python - Coders Camp - Medium

Aman Kharwal ・ Feb 7, 2021 ・
Medium

Back to TOC

Iris Flowers Classification

Iris flower classification is a very popular machine learning project. The iris dataset contains three classes of flowers, Versicolor, Setosa, Virginica, and each class contains 4 features, ‘Sepal length’, ‘Sepal width’, ‘Petal length’, ‘Petal width’. The aim of the iris flower classification is to predict flowers based on their specific features.

Download the dataset here

Machine Learning project Workflow:

1. Importing the libraries

You'll be using numpy, matplotlib, seaborn, pandas, and scikit-learn. You can find a source code of the iris flower classification for download here with opencv.

2. Analyze and visualize the dataset

sns.pairplot(df, hue='Class_labels')

# Separate features and target  
data = df.values
X = data[:,0:4]
Y = data[:,4]

# Calculate average of each features for all classes
Y_Data = np.array([np.average(X[:, i][Y==j].astype('float32')) for i in range (X.shape[1])
 for j in (np.unique(Y))])
Y_Data_reshaped = Y_Data.reshape(4, 3)
Y_Data_reshaped = np.swapaxes(Y_Data_reshaped, 0, 1)
X_axis = np.arange(len(columns)-1)
width = 0.25

plt.bar(X_axis, Y_Data_reshaped[0], width, label = 'Setosa')
plt.bar(X_axis+width, Y_Data_reshaped[1], width, label = 'Versicolour')
plt.bar(X_axis+width*2, Y_Data_reshaped[2], width, label = 'Virginica')
plt.xticks(X_axis, columns[:4])
plt.xlabel("Features")
plt.ylabel("Value in cm.")
plt.legend(bbox_to_anchor=(1.3,1))
plt.show()

3. Model training

Here you want to split the whole data into training and testing datasets. The testing dataset will be used to check the accuracy of the model. You feed the training dataset into the algorithm.

4. Model evaluation

Now you predict the classes from the test dataset from the trained model and check the accuracy score of the predicted classes.

5. Testing the model

Back to TOC

Instagram Reach Analysis and Prediction

Here is a dataset you can use for this project. There's even a paper on this topic found here.

Machine Learning project Workflow:

1. Building the dataset

You'll be using libraries such as pandas, numpy, matplotlib, seaborn, plotly, wordcloud, sklearn, etc.

2. The scraper

Instagram's API has a limit of 60 requests/hour to their backend servers. You'll want a scraper to linearly scan the latest posts of a user, then opens each post to retrieve more granular information related to each image.

3. Dataset analysis

If you are using the dataset provided above (here's the link), then let's start from here.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from sklearn.model_selection import train_test_split
from sklearn.linear_model import PassiveAggressiveRegressor

Run this to check if the dataset contains null values:

data.isnull().sum()

And this should be the output:

Impressions       1
From Home         1
From Hashtags     1
From Explore      1
From Other        1
Saves             1
Comments          1
Shares            1
Likes             1
Profile Visits    1
Follows           1
Caption           1
Hashtags          1
dtype: int64

When you get null values, you'll want to drop them by running data = data.dropna(). Next:

data = pd.read_csv("Instagram.csv", encoding = 'latin1')
print(data.head())

You'll get something like this...

   Impressions  From Home  From Hashtags  From Explore  From Other  Saves  \
0       3920.0     2586.0         1028.0         619.0        56.0   98.0   
1       5394.0     2727.0         1838.0        1174.0        78.0  194.0   
2       4021.0     2085.0         1188.0           0.0       533.0   41.0   
3       4528.0     2700.0          621.0         932.0        73.0  172.0   
4       2518.0     1704.0          255.0         279.0        37.0   96.0   

   Comments  Shares  Likes  Profile Visits  Follows  \
0       9.0     5.0  162.0            35.0      2.0   
1       7.0    14.0  224.0            48.0     10.0   
2      11.0     1.0  131.0            62.0     12.0   
3      10.0     7.0  213.0            23.0      8.0   
4       5.0     4.0  123.0             8.0      0.0   

                                             Caption  \
0  Here are some of the most important data visua...   
1  Here are some of the best data science project...   
2  Learn how to train a machine learning model an...   
3  Heres how you can write a Python program to d...   
4  Plotting annotations while visualizing your da...   

                                            Hashtags  
0  #finance #money #business #investing #investme...  
1  #healthcare #health #covid #data #datascience ...  
2  #data #datascience #dataanalysis #dataanalytic...  
3  #python #pythonprogramming #pythonprojects #py...  
4  #datavisualization #datascience #data #dataana...

4. Visualizing data

To get different plots, you can run:

plt.figure(figsize=(10, 8))
plt.title("Distribution of Impressions From Hashtags")
sns.distplot(data['From Hashtags'])
plt.show()

and/or

home = data["From Home"].sum()
hashtags = data["From Hashtags"].sum()
explore = data["From Explore"].sum()
other = data["From Other"].sum()

labels = ['From Home','From Hashtags','From Explore','Other']
values = [home, hashtags, explore, other]

fig = px.pie(data, values=values, names=labels, title='Impressions on Instagram Posts From Various Sources', hole=0.5)
fig.show()

and/or

text = " ".join(i for i in data.Caption)
stopwords = set(STOPWORDS)
wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text)
plt.style.use('classic')
plt.figure( figsize=(12,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

5. Prediction Model

You'll want to split the data into training and test sets.

x = np.array(data[['Likes', 'Saves', 'Comments', 'Shares', 'Profile Visits', 'Follows']])
y = np.array(data["Impressions"])
xtrain, xtest, ytrain, ytest = train_test_split(x, y, 
                                                test_size=0.2, 
                                                random_state=42)

Then predict the reach of an Instagram post by giving inputs into the ML model.

Check thecleverprogrammer for the full code.

Back to TOC

BigMart Sales Prediction

Dataset for the project

Machine Learning project Workflow:

1. Exploratory data analysis (EDA)

Distribution of target variables
Numerical predictors
Categorical predictors
Distribution of variables
Bivariate analysis

2. Data Pre-processing

Looking for missing values
Inputting missing values
Normalization of dataset for improved results

3. Feature engineering

Creating broad categories
Modifying categories

4. Building a model

fit(x, y)
predict(x)
test_size=0.2
n_estimators=50
learning_rate = 0.1
random_state = default

BigMart Outlet Sales Prediction. The data scientists at BigMart have… | by Precious Kolawole | Medium

Precious Kolawole ・ Nov 10, 2022 ・
precillieo.Medium

Back to TOC

Stock Prices Predictor using TimeSeries

where P1 to Pn are n immediate data points that occur before the present, so to predict the present data point, we take the SMA of the size n (meaning that we see up to n data points in the past).

where Pt is the price at time t and k is the weight given to that data point. EMA(t-1) represents the value computed from the past t-1 points. Clearly, this would perform better than a simple MA. The weight k is computed as k = 2/(N+1).

Looking closely at the formula of RMSE, we can see how we will be able to consider the difference (or error) between the actual (At) and predicted (Ft) price values for all N timestamps and get an absolute measure of error.

On the other hand, MAPE looks at the error concerning the true value – it will measure relatively how far off the predicted values are from the truth instead of considering the actual difference. This is a good measure to keep the error ranges in check if we deal with too large or small values. For instance, RMSE for values in the range of 10e6 might blow out of proportion, whereas MAPE will keep error in a fixed range.

Download stock data from yahoo

Machine Learning project Workflow:

1. Loading the datasets and libraries

You'll be using pandas, matplotlib, datetime, numpy, sklearn, etc.

2. Data Preprocessing

You'll have 757 data samples in the dataset. An LSTM model requires a window or timestep of data in each training step. For example, each 10 data samples to predict the 10th one.

3. Train and test sets

Here, you want to split the data into training and testing sets.

4. Building the LSTM model

5. Performance Evaluation on test set

To get better results with the same dataset, you add another LSTM layer and increase the number of LSTM units per layer.

Check projectpro.io for the full code and explanation.

Time-Series Forecasting: Predicting Stock Prices Using An LSTM Model | by Serafeim Loukas, PhD | Towards Data Science

In this post I show you how to predict stock prices using a forecasting LSTM model

towardsdatascience.com

Back to TOC

Waiter Tips Analysis & Prediction

Tipping waiters for serving food depends on many factors like the type of restaurant, how many people you are with, how much amount you pay as your bill, etc. Waiter Tips analysis is one of the popular data science case studies where we need to predict the tips given to a waiter for serving the food in a restaurant.

Download the dataset here

Machine Learning project Workflow:

1. Import libraries and dataset

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

data = pd.read_csv("tips.csv")
print(data.head())

2. Data Analysis

figure = px.scatter(data_frame = data, x="total_bill",
                    y="tip", size="size", color= "day", trendline="ols")
figure.show()

figure = px.pie(data, values='tip', names='day',hole = 0.5)
figure.show()

3. Prediction Model

You'll want to format your data first:

data["sex"] = data["sex"].map({"Female": 0, "Male": 1})
data["smoker"] = data["smoker"].map({"No": 0, "Yes": 1})
data["day"] = data["day"].map({"Thur": 0, "Fri": 1, "Sat": 2, "Sun": 3})
data["time"] = data["time"].map({"Lunch": 0, "Dinner": 1})
data.head()

Then split your data into training and test sets:

x = np.array(data[["total_bill", "sex", "smoker", "day", "time", "size"]])
y = np.array(data["tip"])

from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, 
                                                test_size=0.2, 
                                                random_state=42)

4. Training the model

You can use LinearRegression from sklearn here:

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(xtrain, ytrain)

Check out thecleverprogrammer for the full code and explanation.

Back to TOC

Music Recommendation System

See codespeedy for a step-by-step guideline

Covid-19 Deaths Prediction

Governments and other legislative bodies rely on these kinds of machine learning predictive models and ideas to suggest new policies and assess the effectiveness of applied policies.

Download dataset 1

Download dataset 2

Machine Learning project Workflow:

1. Import the libraries and dataset

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px

from fbprophet import Prophet
from sklearn.metrics import r2_score

plt.style.use("ggplot")

df0 = pd.read_csv("CONVENIENT_global_confirmed_cases.csv")
df1 = pd.read_csv("CONVENIENT_global_deaths.csv")

2. Data preparation

Combine the above dataset and get a visualization of the data to see what you are working with.

3. Data Visualization

fig = px.choropleth(world.dropna(),locations="Alpha3", color="Cases Range", projection="mercator", color_discrete_sequence ["white","khaki","yellow","orange","red"])
fig.update_geos(fitbounds="locations", visible=False)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

4. Prediction for the next 30 days

Use Facebook prophet model here

model = Fbprophet()
model.fit(df_fb)
model.forecast(30,"D")
model.R2()

forecast = model.df_forecast[["ds","yhat_lower","yhat_upper","yhat"]].tail(30).reset_index().set_index("ds").drop("index",axis=1)
forecast["yhat"].plot(marker=".",figsize=(10,5))
plt.fill_between(x=forecast.index, y1=forecast["yhat_lower"], y2=forecast["yhat_upper"],color="gray")
plt.legend(["forecast","Bound"],loc="upper left")
plt.title("Forecasting of Next 30 Days Cases")
plt.show()

Check thecleverprogrammer for the full code and explanation.

Stress Detection

Stress, anxiety, and depression are threatening the mental health of people. Every person has a reason for having a stressful life. Many content creators have come forward to create content to help people with their mental health. Many organizations can use stress detection to find which social media users are stressed to help them quickly.

Download the dataset

Machine Learning project Workflow:

1. Import the libraries and dataset

import pandas as pd
import numpy as np
data = pd.read_csv("stress.csv")
print(data.head())

2. Visualize the dataset

import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
text = " ".join(i for i in data.text)
stopwords = set(STOPWORDS)
wordcloud = WordCloud(stopwords=stopwords, 
                      background_color="white").generate(text)
plt.figure( figsize=(15,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

3. Building the model

The label column in this dataset contains labels as 0 and 1. 0 means no stress, and 1 means stress.

data["label"] = data["label"].map({0: "No Stress", 1: "Stress"})
data = data[["text", "label"]]
print(data.head())

4. Splitting the dataset

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

x = np.array(data["text"])
y = np.array(data["label"])

cv = CountVectorizer()
X = cv.fit_transform(x)
xtrain, xtest, ytrain, ytest = train_test_split(X, y, 
                                                test_size=0.33, 
                                                random_state=42)

5. Training the model

from sklearn.naive_bayes import BernoulliNB
model = BernoulliNB()
model.fit(xtrain, ytrain)

6. Testing the model

user = input("Enter a Text: ")
data = cv.transform([user]).toarray()
output = model.predict(data)
print(output)

Check thecleverprogrammer for the full code and explanation

Back to TOC

Helpful Links

Happy coding!

Top comments (1)

Arunprakash142 • Aug 1 '22

Thanks for the great post you posted. I like the way you describe the unique content. The points you raise are valid and reasonable. If any of the final year students are looking for the deep learning projects for final year.

Table of Contents

Zillow Home Value Prediction

1. Import Libraries and Loading Dataset

2. Exploratory Data Analysis

3. Fix and clean the data

4. Data splitting

5. Baseline model training

3 models: a hard coded model that only predicts average, Linear Regression, and Decision Tree models.

6. Feature engineering & Feature selection

7. Data Pre-processing

8. Robust model Training and Hyperparameter tuning

Forecasting Real Estate Prices using ML: Time Series Modeling | by Andrea Cabello | Python in Plain English

Article Recommendation System

1. Import Libraries and Loading Dataset

2. Getting images from Google Drive

3. Data preparation and cleaning

4. Retrieve image embed with FastAI

5. Testing the system

4 Recommendation System Projects with Python - Coders Camp - Medium

Aman Kharwal ・ Feb 7, 2021 ・ Medium

Iris Flowers Classification

1. Importing the libraries

2. Analyze and visualize the dataset

3. Model training

4. Model evaluation

5. Testing the model

Instagram Reach Analysis and Prediction

1. Building the dataset

2. The scraper

3. Dataset analysis

4. Visualizing data

5. Prediction Model

BigMart Sales Prediction

1. Exploratory data analysis (EDA)

2. Data Pre-processing

3. Feature engineering

4. Building a model

BigMart Outlet Sales Prediction. The data scientists at BigMart have… | by Precious Kolawole | Medium

Precious Kolawole ・ Nov 10, 2022 ・ precillieo.Medium

Stock Prices Predictor using TimeSeries

1. Loading the datasets and libraries

2. Data Preprocessing

3. Train and test sets

4. Building the LSTM model

5. Performance Evaluation on test set

Time-Series Forecasting: Predicting Stock Prices Using An LSTM Model | by Serafeim Loukas, PhD | Towards Data Science

Waiter Tips Analysis & Prediction

1. Import libraries and dataset

2. Data Analysis

3. Prediction Model

4. Training the model

Music Recommendation System

Covid-19 Deaths Prediction

1. Import the libraries and dataset

2. Data preparation

3. Data Visualization

4. Prediction for the next 30 days

Stress Detection

1. Import the libraries and dataset

2. Visualize the dataset

3. Building the model

4. Splitting the dataset

5. Training the model

6. Testing the model

Helpful Links

Read next

Birthday Cake Candles - HackerRank Problem Solving

TDoC 2024 - Day 3: Introduction to Machine Learning

NVIDIA Ada Lovelace architecture for AI and Deep Learning

Copier vs Cookiecutter

Aman Kharwal ・ Feb 7, 2021 ・
Medium

Precious Kolawole ・ Nov 10, 2022 ・
precillieo.Medium