Continuing our learning from where we left in the Part-1 of this tutorial series, where we discussed about Amazon Redshift briefly and dive deep into Amazon Redshift ML. We also learnt about, how a database engineer/administrator could make use of Redshift ML to create
, train
and deploy
a machine learning model using familiar SQL commands.
Now, we are going to see some of the advanced functionalities of Amazon Redshift ML which a Data Analyst or an expert Data Scientist can make use of, which offers more flexibility in terms of defining specific information, like which algorithm
to use (such as XGBoost), specifying hyperparameter
, preprocessor
and so on.
Exercise 2 (Data Analyst's perspective)
Dataset
In this problem, we are going to use the Steel Plates Faults Data Set from UCI Machine Learning Repository. You can download the dataset from this GitHub Repo.
This dataset is related to the quality of steel plates, wherein there are 27 independent variables
(input features) which comprises of various attributes of a steel plate and one dependent variable
(class label) which can be of 1 of 7 types. So, the problem in hand is a multi-class classification problem, where we need to predict the fault in the steel plate, given 7 different types of the faults that it can have.
So, the objective
is to predict what is the fault
the steel plate has (Pastry, Z_Scratch, K_Scatch, Stains, Dirtiness, Bumps or Other_Faults)
As we have seen in Part-1, since our dataset is located in Amazon S3, first we need to load the data in table. We can open DataGrip(or whatever SQL Connector you are using) and create the schema
and the table
. Once that is done, we can use COPY
command to load the training data from Amazon S3 (steel_fault_train.csv) to the Redshift
cluster, in the table, steel_plates_fault
.
As always, we need to make sure that colum names of the table matches with the feature sets in the CSV
training dataset file.
Similarly we can load the dataset for the testing(steel_fault_test.csv) in a separate table, steel_plates_fault_inference
Training (Model Creation)
Now, being a data analyst, you may like to explicitly mention few of the parameters, like PROBLEM_TYPE
and OBJECTIVE
function. When you provide this information while creating the model, Amazon SageMaker Autopilot chooses the PROBLEM_TYPE
and OBJECTIVE
specified by you, instead of tying everything.
Like for this problem, we are going to provide the PROBLEM_TYPE
as multiclass_classification
and OBJECTIVE
as accuracy
.
Other PROBLEM_TYPE
we can specify are :
- REGRESSION
- BINARY_CLASSIFICATION
- MULTICLASS_CLASSIFICATION
Similarly, OBJECTIVE
function could be:
- MSE
- Accuracy
- F1
- F1Macro
- AUC
As we have learnt in the PART-1 of the tutorial, the CREATE MODEL
command operates in an asynchronous
mode and it returns the response upon the export of training data to Amazon S3. As the remaining steps of model training and compilation can take a longer time, it continues to run in the background.
But we can always check the status of the training using the STV_ML_MODEL_INFO
function, and wait till the model_state
becomes Model is Ready
.
Now, let's look at the details about the model, and see if it has used the same PROBLEM_TYPE
and OBJECTIVE
function which we mentioned while executed the CREATE MODEL
command
Accuracy of the Model and Prediction/Inference
Lastly, let's try to see what's the accuracy of our model using the test data which we have in the steel_plates_fault_inference
table.
As we can see the accuracy is around 77%
, which is not all that great, but this is because we used a very small dataset to train the model, func_model_steel_fault
.
And finally, let's try to do some prediction using this same model function
Lastly, let's take another example and this time from a Data Scientist's perspective, wherein we will make use of some more advanced options while executing the CREATE MODEL
command.
Exercise 3 (Data Scientist's perspective)
So the last two problems we worked on, were classification
problem (binary and multi-class), and this time we will work on a regression
problem and shall use some more advanced parameters while training the model (like mentioning the training algorithm, hyperparameter, etc.).
Dataset
In this problem, we are going to use the Abalone Data Set from UCI Machine Learning Repository. You can download the dataset from this GitHub Repo.
In this problem we need to predict the age of a abalone from its physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task.
The dataset is having total 7 input features
and 1 target
, which is nothing but the age
So, first let's create the schema
and the table
. Once that is done, we can use COPY
command to load the training data from Amazon S3 (xgboost_abalone_train.csv) to the Redshift
cluster, in the table, abalone_xgb_train
.
Similarly we can load the dataset for the testing(xgboost_abalone_test.csv) from Amazon S3, in a separate table, abalone_xgb_test
Training (Model Creation)
As a data scientist, you may like to have more control over training the model, e.g you may decide to provide more granular options, like MODEL_TYPE
, OBJECTIVE
, PREPROCESSORS
and HYPER PARAMETERS
while running the CREATE MODEL
command.
As an advanced user, you may already know the model type that you want and hyperparameter to use when training these models. You can use CREATE MODEL
command with AUTO OFF
to turn off the CREATE MODEL
automatic discovery of preprocessors and hyperparameters.
For this problem we are going to specify MODEL_TYPE
as xgboost
(Xtreme Gradient Boosted tree) which we can use for both regression and classification based problems. XGBoost is currently the only MODEL_TYPE
supported when AUTO is set to OFF. We are also going to use the OBJECTIVE
function as reg:squarederror
. You can specify hyperparameters as well. For more details, you may like to check the Amazon Redshift ML Developer Guide(CREATE MODEL section)
Now, let's look at the details about the model, as we did before:
Accuracy of the Model and Prediction/Inference
Now, let's try to see what's the accuracy of our model, using the test data which we have in the abalone_xgb_test
table.
And finally let's try to do some prediction using this same model function, func_model_abalone_xgboost_regression
What next...
So, in this tutorial we learnt about Amazon Redshift ML from an advanced users perspective (like Data Analyst or Data Scientist), and learnt how we can create
, train
and deploy
a ML model using familiar SQL query. You can even go further and explore the training jobs which it internally initiates in the Amazon SageMaker
console, if you are interested. Feel free to give it a try and share your feedback.
Resources
- Code : GitHub repo
- Blog : Amazon Redshift ML - Machine Learning in SQL Style (Part-1)
- Documentation:
- Book:
- Videos:
Top comments (0)