Building Machine Learning Apps Faster With dstack.ai
(Originally Published at q-viper.github.io)
Happy New Year everyone!!!!
This is the second part of my dstack blog. This part is more exciting and awesome than the previous one because pushing and pulling ML models is very handy when it comes to sharing our models with co-workers or a trusted person.
If you are new to dstack then I request you to view my previous blog or the documentation.
According to dstack’s documentation:
dstack
decouples the development of applications from the development of ML models by offering an ML registry. This way, one can develop ML models, push them to the registry, and then later pull these models from applications.
In the first part of my dstack program, we were able to push our visualization of Titanic Survival Dataset and in this part, I will train 3 Classifiers to classify whether a person will survive or not.
Project Structure
This time our project will be a little bit organized than the previous one because we require 2 python scripts. One for Pushing ML Model and another for Pulling ML Model.
- Root File
- Data
- titanic_data.csv
- titanic_push.py
- titanic_pull.py
File titanic_push.py
As usual, we start by importing dependencies. In this same file, we will be training 3 classifier models and push them under our Model Registry.
- Decision Tree
- Random Forest
- Gradient Boosting
I am following this blog for training a Model.
import dstack.controls as ctrl
import dstack as ds
import pandas as pd
import matplotlib.pyplot as plt
import random
import numpy as np
from sklearn import datasets, svm, tree, preprocessing, metrics
import sklearn.ensemble as ske
from sklearn.model_selection import train_test_split, ShuffleSplit, cross_val_score
@ds.cache()
def get_data():
filename = "F:/Desktop/learning/dstack/blog/data/titanic_data.csv"
return pd.read_csv(filename)
df = get_data()
df = df.drop(['Cabin'], axis=1)
df = df.dropna()
def preprocess_df(df):
processed_df = df.copy()
le = preprocessing.LabelEncoder()
processed_df.Sex = le.fit_transform(processed_df.Sex)
processed_df.Embarked = le.fit_transform(processed_df.Embarked)
processed_df = processed_df.drop(['Name','Ticket'],axis=1)
return processed_df
processed_df = preprocess_df(df)
X = processed_df.drop(['Survived'], axis=1).values
y = processed_df['Survived'].values
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
clf_dt = tree.DecisionTreeClassifier(max_depth=10)
clf_dt.fit (X_train, y_train)
# clf_dt.score (X_test, y_test)
url = ds.push("titanic/decision_tree", clf_dt)
print("Decision tree ", url)
shuffle_validator = ShuffleSplit(len(X), test_size=0.2, random_state=0)
def test_classifier(clf):
scores = cross_val_score(clf, X, y, cv=shuffle_validator)
return scores.mean()
print(f"Decision Tree Acc: {test_classifier(clf_dt)}\n")
clf_rf = ske.RandomForestClassifier(n_estimators=50)
clf_rf.fit (X_train, y_train)
url = ds.push("titanic/random_forest", clf_rf)
print("Random Forest ", url)
print(f"Random Forest Acc: {test_classifier(clf_rf)}\n")
clf_gb = ske.GradientBoostingClassifier(n_estimators=50)
clf_gb.fit (X_train, y_train)
url = ds.push("titanic/gradient_boosting", clf_gb)
print("Gradient Boosting ", url)
print(f"Gradient Boosting Acc: {test_classifier(clf_gb)}\n")
- We start by importing dependencies.
- Make a method to read CSV a file from local storage. Cache that method because we might call that method frequently.
- Drop NULL data, and some non-numeric column names like Cabin, Name, Ticket.
- Preprocess our data a little bit to make it trainable.
- Perform split of data into trainset and test-set.
- Train a Decision Tree and push it to
titanic/decision_tree
and print its URL. - Train a Random Forest and push it to
titanic/random_forest
and print its URL. - Train a Gradient Boosting and push it to
titanic/gradient_boosting
and print its URL.
Something like the below will be shown on the terminal.
After going through any URL and then to ML Models tab on the left navigation panel, we can see something like below:
I already had some models before making this blog so there are more than 3 models.
We have successfully pushed our models and if we went through titanic/gradient_boosting
, then we will see something like below.
dstack
has provided a wonderful way to make documentation of our model by allowing us to make a readme file. We can write about the performance of our model or the use case of our models there. I find this feature very useful because I can write about the property of my model in plain text.
File titanic_pull.py
We have pushed our model to the registry now is the time to pull our model from some remote area or a different location. I am not going to pull this model from any remote area but a different python file.
import dstack.controls as ctrl
import dstack as ds
import pandas as pd
import matplotlib.pyplot as plt
import random
import numpy as np
from sklearn import datasets, svm, tree, preprocessing, metrics
import sklearn.ensemble as ske
from sklearn.model_selection import train_test_split, ShuffleSplit, cross_val_score
@ds.cache()
def get_data():
filename = "F:/Desktop/learning/dstack/blog/data/titanic_data.csv"
return pd.read_csv(filename)
df = get_data()
titanic_df=df.copy()
titanic_df = titanic_df.drop(['Cabin'], axis=1)
titanic_df = titanic_df.dropna()
def preprocess_titanic_df(df):
processed_df = df.copy()
le = preprocessing.LabelEncoder()
processed_df.Sex = le.fit_transform(processed_df.Sex)
processed_df.Embarked = le.fit_transform(processed_df.Embarked)
processed_df = processed_df.drop(['Name','Ticket'],axis=1)
return processed_df
processed_df = preprocess_titanic_df(titanic_df)
X = processed_df.drop(['Survived'], axis=1).values
y = processed_df['Survived'].values
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
def get_decision_tree():
return ds.pull("titanic/decision_tree")
def get_random_forest():
return ds.pull("titanic/random_forest")
def get_gradient_boosting():
return ds.pull("titanic/gradient_boosting")
def dt_pred():
dt = get_decision_tree()
p = dt.predict(X)
pdf = processed_df.copy()
pdf["DT Pred"] = p
return pdf
def rf_pred():
rf = get_random_forest()
p = rf.predict(X)
pdf = processed_df.copy()
pdf["RF Pred"] = p
return pdf
def gb_pred():
gb = get_gradient_boosting()
p = gb.predict(X)
pdf = processed_df.copy()
pdf["GB Pred"] = p
return pdf
dt_app = ds.app(dt_pred)
rf_app = ds.app(rf_pred)
gb_app = ds.app(gb_pred)
url = ds.push("titanic/dt_pred", dt_app)
print(f"Decision Tree: {url}\n")
url = ds.push("titanic/rf_pred", rf_app)
print(f"Random Forest: {url}\n")
url = ds.push("titanic/gb_pred", gb_app)
print(f"Gradient Boosting: {url}\n")
What is happening above?
- Same as
pushing
code, our pulling code starts by importing dependencies. - Read the data from local storage and preprocess it because we will be using this dataset to find out the prediction of our models.
- Make a function to pull each model and return it.
- Make a function to do prediction using the pulled function and then stacking that prediction to a new column of the data frame and return that data frame.
- Make a data app for each of these applications(decision tree, random forest, and gradient boosting).
- Push each application and print its URL.
If everything is right, we can have URLs for each app. After going to titanic/dt_pred
, something like below should be shown:
Finally
I have just trained then pushed/pulled some simple classifiers for Titanic Survival Dataset and also stacked their prediction to the new column. If we want to share this model with anyone then we simply can go to share and choose whether we want it to be public or not. In the next part, I will be writing about training a Tensorflow model and then reusing it. Also, I have not figured out all the cool UI tools that dstack provides, so in the next part, I will try to use them and make a more cool project.
If you reached this line then please leave some comments so that I can improve myself. Also if you have any queries then ping me on LinkedIn as Ramkrishna Acharya.
Top comments (0)