Intro
I have completed the Capstone project, and this post is a summary of it. I decided to build a new recommendation system for video games at Amazon.com. The reasons are the followings:
- I play lots of video games.
- I do shopping at Amazon.com
- 35% of Amazon's revenue is from the recommendation system. link
- I don't see a recommendation system that specifically uses customer's ratings at Amazon.com
There are already various recommendations that use "items bought together", "browsing history", "similar products", etc. More than half of the recommended items at Amazon.com are not relevant to my interest, and many items are similar to what I already have and won't buy again.
It gets even worse when looking at the page for the video game department. Amazon is busy showing the top-selling or top-rated items, new releases, and items with special discounts. I don't feel like the page is personalized.
The page might gain more attention if Amazon can add a section with a personalized list of recommended games.
Outline
I plan to write three posts outlined below. This post focuses on the first part.
- Introduction and building a recommendation list using customer ratings
- Proving customers would have a different shopping experience
- Trimming down a long list of recommended games and wrapping up
Data
There is a database released by Amazon. There are two parts to the database, review data and metadata. If the links to the two separate datasets below do not work, please use this link.
Review Data
The data can be downloaded here.
This dataset has four columns, item id (video game id), user id, rating, and timestamp. We do not need the timestamp column for this analysis.
Here are the first 5 rows of the data.
Here is the count for each rating and overall distribution. We can see ratings are from 1 to 5, and the majority (58%) of ratings are 5.
MetaData
The data can be downloaded here.
This metadata has various information on video games, but we only need item id, console information, and company information. Unfortunately, this data does not have genre information.
This data will be crucial for the second and third parts. I will show you how this data is clean in the second part. We only need the review data for the current part.
Analysis - Building a System
Train-test-split
The data is split for validation purpose.
Python has a convenient library called "surprise". This library is designed specifically for the recommender system. Surprise already has a function to do train-test-split, but the data format was complicated to play with. I have used both scikit-learn and surprise to split and format the data.
# selecting X and y
X = df[df.columns[:2]] #item_id, user_id
y = df[df.columns[-1]] #rating
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test
= train_test_split(X, y, test_size=0.33, random_state=0)
from surprise import Reader, Dataset
# read in values as Surprise dataset
reader = Reader()
# train data
# Loading the data again for the Surprise library
train = Dataset.load_from_df(pd.concat([X_train, y_train],
axis = 1), reader)
# test data
# Loading the data again for the Surprise library
test = Dataset.load_from_df(pd.concat([X_test, y_test],
axis = 1), reader)
# whole data for comparison
data = Dataset.load_from_df(df, reader)
Collaborative Filtering
The Surprise library uses a matrix factorization to predict ratings on items. This is a supervised learning technique because the rating information is treated as a continuous variable.
The table above shows 5 items and 4 users as an example. Users do not rate all items. The system predicts ratings that can be replaced with missing ratings. For example, user 4 would have 4 missing ratings replaced with predicted ratings. Then, the games will be recommended to the user based on the highest predicted rating.
The codes below calculate the RMSE values that can be compared with the RMSE values from various scikit-learn regressors, and the Surprise library has the best RMSE.
from surprise.model_selection import cross_validate
from surprise.prediction_algorithms import SVD
import numpy as np
svd = SVD(random_state = 0)
val_svd = cross_validate(svd, train, measures=['RMSE', 'MAE'], cv=3)
print("Mean RMSE for the baseline model validation:")
np.mean(val_svd['test_rmse'])
svd = SVD(random_state = 0).fit(train_set)
pred = []
for i in range(len(X_test)):
pred.append(svd.predict(X_test.iloc[i].values[0], X_test.iloc[i].values[1])[3])
from sklearn.metrics import mean_squared_error
print("RMSE for the baseline model on test data:")
print(mean_squared_error(y_test, pred, squared=False))
- Mean RMSE for the baseline model validation: 1.2980
- RMSE for the baseline model on test data: 1.2827
The RMSE value is not so great. A tuned model shows better RMSE values.
svd = SVD(random_state = 0,
n_factors= 100,
reg_all = 0.07,
n_epochs = 150)
- Mean RMSE for the tuned model validation: 1.2850
- RMSE for the tuned model on test data: 1.2660
RMSE is not improved significantly. I hope for a value of less than 1, this is the best I have. This can be improved with more data cleaning. For example, I can try to find and remove outliers and/or customers with a low number of ratings. I might add the fourth part for the update.
The graph below shows an example of how a selected customer is recommended the video games assigned with predicted ratings.
It is easy to see this customer would be overwhelmed by a huge list of recommended video games. I will talk about how this can be improved, but it would have more worth talking first about if this system (the tuned model) is useful.
Top comments (0)