About our dataset
Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide.
Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.
Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.
Data Source https://www.kaggle.com/andrewmvd/heart-failure-clinical-data
Task
Create a model for predicting mortality caused by Heart Failure.
12 clinical features for predicting death events.
Our Machine Learning WorkFlow
I feel a lot comfortable defining my workflow for solving a machine learning problem before ever starting to solve those problem as that gives me a feel of direction. However, this may be different for someother persons ✍🏻
Below are the steps we are going to take to solve this machine learning problem
Problem Definition and Data Collection
-
Get the data ready for use (Data Preprocessing)
- Check for missing values
- Fill missing values, if any.
- Turn categorical Features to Numerical
Feature Engineering:
Feature engineering is the process of using domain knowledge to extract features (characteristics, properties, attributes) from raw data. A feature is a property shared by independent units on which analysis or prediction is to be done. Features are used by predictive models and influence results.Modeling
Make Predictions
Evaluate Model Performance metrics
See if we need to improve our model
Export our trained model
Load our trained model
1. Problem Definition
People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.
Data Collection
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
df = pd.read_csv('Datasets/heart_failure_clinical_records_dataset.csv')
df.head()
age | anaemia | creatinine_phosphokinase | diabetes | ejection_fraction | high_blood_pressure | platelets | serum_creatinine | serum_sodium | sex | smoking | time | DEATH_EVENT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 75.0 | 0 | 582.0 | 0.0 | 20.0 | 1 | 265000 | 1.9 | 130.0 | 1.0 | 0.0 | 4.0 | 1 |
1 | 55.0 | 0 | 7861.0 | 0.0 | 38.0 | 0 | 263358.03 | 1.1 | 136.0 | 1.0 | 0.0 | 6.0 | 1 |
2 | 65.0 | 0 | 146.0 | NaN | NaN | 0 | 162000 | 1.3 | 129.0 | 1.0 | 1.0 | 7.0 | 1 |
3 | 50.0 | 1 | 111.0 | 0.0 | 20.0 | 0 | 210000 | 1.9 | 137.0 | 1.0 | 0.0 | 7.0 | 1 |
4 | 65.0 | 1 | NaN | 1.0 | 20.0 | 0 | 327000 | 2.7 | 116.0 | 0.0 | 0.0 | 8.0 | 1 |
# Let's get a summary of our data set
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 298 non-null float64
1 anaemia 299 non-null int64
2 creatinine_phosphokinase 292 non-null float64
3 diabetes 294 non-null float64
4 ejection_fraction 294 non-null float64
5 high_blood_pressure 296 non-null object
6 platelets 288 non-null object
7 serum_creatinine 299 non-null float64
8 serum_sodium 294 non-null float64
9 sex 298 non-null float64
10 smoking 298 non-null float64
11 time 297 non-null float64
12 DEATH_EVENT 299 non-null int64
dtypes: float64(9), int64(2), object(2)
memory usage: 30.5+ KB
# Check for duplicate
df.duplicated().sum()
0
# Let's check how correlated each feature is to another
correlation = df.corr()
correlation
age | anaemia | creatinine_phosphokinase | diabetes | ejection_fraction | serum_creatinine | serum_sodium | sex | smoking | time | DEATH_EVENT | |
---|---|---|---|---|---|---|---|---|---|---|---|
age | 1.000000 | 0.091852 | -0.088883 | -0.096539 | 0.061344 | 0.155650 | -0.052726 | 0.059484 | 0.018416 | -0.219635 | 0.252176 |
anaemia | 0.091852 | 1.000000 | -0.184689 | -0.015766 | 0.026149 | 0.054518 | 0.060957 | -0.099176 | -0.109526 | -0.133430 | 0.066270 |
creatinine_phosphokinase | -0.088883 | -0.184689 | 1.000000 | -0.022312 | -0.054024 | -0.012482 | 0.057265 | 0.074000 | 0.005451 | -0.025688 | 0.073627 |
diabetes | -0.096539 | -0.015766 | -0.022312 | 1.000000 | -0.015566 | -0.043697 | -0.092627 | -0.154765 | -0.135563 | 0.028746 | 0.001362 |
ejection_fraction | 0.061344 | 0.026149 | -0.054024 | -0.015566 | 1.000000 | -0.010466 | 0.199457 | -0.146827 | -0.057282 | 0.028818 | -0.261605 |
serum_creatinine | 0.155650 | 0.054518 | -0.012482 | -0.043697 | -0.010466 | 1.000000 | -0.181161 | 0.010253 | -0.026469 | -0.147806 | 0.290386 |
serum_sodium | -0.052726 | 0.060957 | 0.057265 | -0.092627 | 0.199457 | -0.181161 | 1.000000 | -0.042738 | 0.001440 | 0.057033 | -0.175385 |
sex | 0.059484 | -0.099176 | 0.074000 | -0.154765 | -0.146827 | 0.010253 | -0.042738 | 1.000000 | 0.446947 | -0.005693 | -0.007482 |
smoking | 0.018416 | -0.109526 | 0.005451 | -0.135563 | -0.057282 | -0.026469 | 0.001440 | 0.446947 | 1.000000 | -0.019119 | -0.014233 |
time | -0.219635 | -0.133430 | -0.025688 | 0.028746 | 0.028818 | -0.147806 | 0.057033 | -0.005693 | -0.019119 | 1.000000 | -0.522918 |
DEATH_EVENT | 0.252176 | 0.066270 | 0.073627 | 0.001362 | -0.261605 | 0.290386 | -0.175385 | -0.007482 | -0.014233 | -0.522918 | 1.000000 |
df.describe()
age | anaemia | creatinine_phosphokinase | diabetes | ejection_fraction | serum_creatinine | serum_sodium | sex | smoking | time | DEATH_EVENT | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 298.000000 | 299.000000 | 292.000000 | 294.000000 | 294.000000 | 299.000000 | 294.000000 | 298.000000 | 298.000000 | 297.000000 | 299.00000 |
mean | 60.870248 | 0.431438 | 577.688356 | 0.418367 | 38.227891 | 1.391104 | 136.697279 | 0.651007 | 0.322148 | 130.208754 | 0.32107 |
std | 11.898166 | 0.496107 | 972.468942 | 0.494132 | 11.852295 | 1.034449 | 4.338123 | 0.477454 | 0.468085 | 77.365687 | 0.46767 |
min | 40.000000 | 0.000000 | 23.000000 | 0.000000 | 14.000000 | 0.500000 | 113.000000 | 0.000000 | 0.000000 | 4.000000 | 0.00000 |
25% | 51.000000 | 0.000000 | 115.000000 | 0.000000 | 30.000000 | 0.900000 | 134.000000 | 0.000000 | 0.000000 | 73.000000 | 0.00000 |
50% | 60.000000 | 0.000000 | 249.500000 | 0.000000 | 38.000000 | 1.100000 | 137.000000 | 1.000000 | 0.000000 | 115.000000 | 0.00000 |
75% | 70.000000 | 1.000000 | 582.000000 | 1.000000 | 45.000000 | 1.400000 | 140.000000 | 1.000000 | 1.000000 | 201.000000 | 1.00000 |
max | 95.000000 | 1.000000 | 7861.000000 | 1.000000 | 80.000000 | 9.400000 | 148.000000 | 1.000000 | 1.000000 | 285.000000 | 1.00000 |
Definition of terms from the dataframe .describe() method above
- Count: Number of items of in a particular feature/Column
- Mean Average number (Value gotten by dividing the sum of several quantities by their number)
- Std Standard Deviation (Square root of variance (How much values differ from the mean value) )
- Min Minimum value in each feature or column
- Max Maximum value in each feature or column
- Percentile Splitting our data into different equal segments (25, 50, 75) etc.
# Check number of columns and rows
df.shape #This shows that our dataset containes 299 rows and 13 columns
(299, 13)
# Let's checkout the column names
j = 0
for i in df.columns:
j+=1
print(j, i.upper())
1 AGE
2 ANAEMIA
3 CREATININE_PHOSPHOKINASE
4 DIABETES
5 EJECTION_FRACTION
6 HIGH_BLOOD_PRESSURE
7 PLATELETS
8 SERUM_CREATININE
9 SERUM_SODIUM
10 SEX
11 SMOKING
12 TIME
13 DEATH_EVENT
df['age'].T.hist(bins=50);
# Let's Visualize our correlations better
# sns.set(font_scale=1)
fig, ax = plt.subplots(figsize=(15,10))
ax = sns.heatmap(correlation, annot=True, fmt='.2f', cmap='YlGnBu', linewidths=.05);
pd.crosstab(df.age, df.DEATH_EVENT)
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
DEATH_EVENT | 0 | 1 |
---|---|---|
age | ||
40.000 | 7 | 0 |
41.000 | 1 | 0 |
42.000 | 6 | 1 |
43.000 | 1 | 0 |
44.000 | 2 | 0 |
45.000 | 13 | 6 |
46.000 | 2 | 1 |
47.000 | 1 | 0 |
48.000 | 0 | 2 |
49.000 | 3 | 1 |
50.000 | 18 | 8 |
51.000 | 3 | 1 |
52.000 | 5 | 0 |
53.000 | 9 | 1 |
54.000 | 1 | 1 |
55.000 | 14 | 3 |
56.000 | 1 | 0 |
57.000 | 1 | 1 |
58.000 | 8 | 2 |
59.000 | 1 | 3 |
60.000 | 20 | 13 |
60.667 | 1 | 1 |
61.000 | 4 | 0 |
62.000 | 4 | 1 |
63.000 | 8 | 0 |
64.000 | 3 | 0 |
65.000 | 18 | 8 |
66.000 | 2 | 0 |
67.000 | 2 | 0 |
68.000 | 3 | 2 |
69.000 | 1 | 2 |
70.000 | 18 | 7 |
72.000 | 2 | 5 |
73.000 | 3 | 1 |
75.000 | 5 | 6 |
77.000 | 1 | 1 |
78.000 | 2 | 0 |
79.000 | 1 | 0 |
80.000 | 2 | 5 |
81.000 | 1 | 0 |
82.000 | 0 | 3 |
85.000 | 3 | 3 |
86.000 | 0 | 1 |
87.000 | 0 | 1 |
90.000 | 1 | 2 |
94.000 | 0 | 1 |
95.000 | 0 | 2 |
2. Data Preprocessing
Check if there are missing values
df.isna()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
age | anaemia | creatinine_phosphokinase | diabetes | ejection_fraction | high_blood_pressure | platelets | serum_creatinine | serum_sodium | sex | smoking | time | DEATH_EVENT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False | False | False | False | False | False | False |
2 | False | False | False | True | True | False | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | False | False | False | False | False | False | False |
4 | False | False | True | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
294 | False | False | False | False | False | False | True | False | False | False | False | False | False |
295 | False | False | False | False | False | False | False | False | False | False | False | False | False |
296 | False | False | False | False | False | False | True | False | False | False | False | False | False |
297 | False | False | False | False | False | False | True | False | False | False | False | False | False |
298 | False | False | False | False | False | False | True | False | False | False | False | False | False |
299 rows × 13 columns
df.isna().sum()
age 1
anaemia 0
creatinine_phosphokinase 7
diabetes 5
ejection_fraction 5
high_blood_pressure 3
platelets 11
serum_creatinine 0
serum_sodium 5
sex 1
smoking 1
time 2
DEATH_EVENT 0
dtype: int64
✍🏻 Now, with the help of pandas .isna() method we are able to find out which columns have empty values in them
Fill the missing values
When handling missing values in during your data exploration and engineering there are different ways to handle missing values.
- Fill them with the mean, mode or median data of their parent column
- Removing the samples with missing data
- Use of Unsupervised learning approach to predict and fill the data
list(df.columns)
['age',
'anaemia',
'creatinine_phosphokinase',
'diabetes',
'ejection_fraction',
'high_blood_pressure',
'platelets',
'serum_creatinine',
'serum_sodium',
'sex',
'smoking',
'time',
'DEATH_EVENT']
Split our data into Features and Label (Independent and Dependent Variables)
# Split the data into features and labels
X = df.drop('DEATH_EVENT', axis=1)
y = df['DEATH_EVENT']
# We can use the SimpleImputer Class in scikit learn to do this but
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
# Let's Define the columns to fill
categoricals = ['platelets', 'high_blood_pressure']
numerical = ['age',
'anaemia',
'creatinine_phosphokinase',
'diabetes',
'ejection_fraction',
'serum_creatinine',
'serum_sodium',
'sex',
'smoking',
'time'] # Removed the two categorical feature (high_blood_pressure and platelets)
# Define the way SimpleImputer will fill the missing values
categorical_imputer = SimpleImputer(strategy='constant', fill_value='None')
numerical_imputer = SimpleImputer(strategy='mean')
# Create the Imputer
transformer = ColumnTransformer([('categoricals', categorical_imputer, categoricals),
('numerical', numerical_imputer, numerical)])
new_X = transformer.fit_transform(X)
Congratulations!! 😎 💃
We've successfully filled and converted all our data to numbers
4. Modelling
Based on our problem and data, what machine learning model should we use? Let the audience decide 😉
Okay, you got it right 🤣, This is a classification problem because we are predicting whether our output is one thing or another i.e. whether it's 1 or 0, True or False, Rice or beans etc. In otherwords, we call this a Binary Classification
from sklearn.ensemble import RandomForestClassifier
# Instantiate our model classes
random_forest_model, ada_boost = RandomForestClassifier(), AdaBoostClassifier()
# Split our data into train and test splits
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(new_X, y, train_size=.8, random_state=21)
# Let's Checkout the shape of our data to understand how our data is being split
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((239, 12), (60, 12), (239,), (60,))
Now Let's Train our machine learning model to find patterns in our data
random_forest_model.fit(X_train, y_train)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-306-7b70e76ee335> in <module>
----> 1 random_forest_model.fit(X_train, y_train)
/opt/anaconda3/lib/python3.8/site-packages/sklearn/ensemble/_forest.py in fit(self, X, y, sample_weight)
302 "sparse multilabel-indicator for y is not supported."
303 )
--> 304 X, y = self._validate_data(X, y, multi_output=True,
305 accept_sparse="csc", dtype=DTYPE)
306 if sample_weight is not None:
/opt/anaconda3/lib/python3.8/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
431 y = check_array(y, **check_y_params)
432 else:
--> 433 X, y = check_X_y(X, y, **check_params)
434 out = X, y
435
/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
61 extra_args = len(args) - len(all_args)
62 if extra_args <= 0:
---> 63 return f(*args, **kwargs)
64
65 # extra_args > 0
/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
812 raise ValueError("y cannot be None")
813
--> 814 X = check_array(X, accept_sparse=accept_sparse,
815 accept_large_sparse=accept_large_sparse,
816 dtype=dtype, order=order, copy=copy,
/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
61 extra_args = len(args) - len(all_args)
62 if extra_args <= 0:
---> 63 return f(*args, **kwargs)
64
65 # extra_args > 0
/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
614 array = array.astype(dtype, casting="unsafe", copy=False)
615 else:
--> 616 array = np.asarray(array, order=order, dtype=dtype)
617 except ComplexWarning as complex_warning:
618 raise ValueError("Complex data not supported\n"
/opt/anaconda3/lib/python3.8/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order, like)
100 return _asarray_with_like(a, dtype=dtype, order=order, like=like)
101
--> 102 return array(a, dtype, copy=False, order=order)
103
104
ValueError: could not convert string to float: 'None'
Now, why are we getting an error? What did we do wrong? What can we do about it ?
# Let's checkout the unique values in our X,y set
y.unique(), new_X[::-1]
(array([1, 0]),
array([['None', '0', 50.0, ..., 1.0, 1.0, 285.0],
['None', '0', 45.0, ..., 1.0, 1.0, 280.0],
['None', '0', 45.0, ..., 0.0, 0.0, 278.0],
...,
['162000', '0', 65.0, ..., 1.0, 1.0, 7.0],
['263358.03', '0', 55.0, ..., 1.0, 0.0, 6.0],
['265000', '1', 75.0, ..., 1.0, 0.0, 4.0]], dtype=object))
Now, because our new_X label for some reasons still contains strings rather than numbers, our machine learning model has refused to work with such data and therefore cause it to raise an error
def fix_object_data(X):
"""
This function will help us fix the string value errors in our dataset by turning string dtypes to numeric
X: DataFrame or dict
"""
for label, content in X.items():
if pd.api.types.is_string_dtype(content):
X[label] = pd.to_numeric(X[label], errors='coerce')
return X
Lets remind ourselves of the datatypes involved in our dataset
X.dtypes
age float64
anaemia int64
creatinine_phosphokinase float64
diabetes float64
ejection_fraction float64
high_blood_pressure object
platelets object
serum_creatinine float64
serum_sodium float64
sex float64
smoking float64
time float64
dtype: object
fix_object_data(X).dtypes
age float64
anaemia int64
creatinine_phosphokinase float64
diabetes float64
ejection_fraction float64
high_blood_pressure float64
platelets float64
serum_creatinine float64
serum_sodium float64
sex float64
smoking float64
time float64
dtype: object
X = fix_object_data(X)
One thing i recommend you do before fill missing values is visulizing the data frequency for outliers, so as to know what averaging method would better suit for filling of the missing values
def value_counts(data:pd.DataFrame, key='age'):
columns = data.columns
val_count = {}
# Loop throught our data
for col in columns:
val_count[col] = data[col].value_counts()
return pd.DataFrame(val_count[key])
# Now let's consider filling this values one after the other since our features are not much
X['age'] = X['age'].fillna(X['age'].mean())
X['creatinine_phosphokinase'].fillna(X['creatinine_phosphokinase'].median(), inplace=True)
X['diabetes'].fillna(X['diabetes'].median(), inplace=True)
X['ejection_fraction'].fillna(X['ejection_fraction'].mean(), inplace=True)
X['high_blood_pressure'].fillna(X['high_blood_pressure'].mean(), inplace=True)
X['platelets'].fillna(X['platelets'].mean(), inplace=True)
X['serum_sodium'].fillna(X['serum_sodium'].median(), inplace=True)
X['sex'].fillna(X['sex'].median(), inplace=True)
X['smoking'].fillna(X['smoking'].median(), inplace=True)
X['time'].fillna(X['time'].median(), inplace=True)
X.isna().sum()
age 0
anaemia 0
creatinine_phosphokinase 0
diabetes 0
ejection_fraction 0
high_blood_pressure 0
platelets 0
serum_creatinine 0
serum_sodium 0
sex 0
smoking 0
time 0
dtype: int64
Now we've fixed our data, lets split and fit into our model for training
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=.8, random_state=21)
# Now let's train our train our model
model = RandomForestClassifier(n_estimators=120,max_depth=10)
model.fit(X_train, y_train)
RandomForestClassifier(max_depth=10, n_estimators=120)
5. Make predictions using our model
model.predict(X_test)
array([1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0,
0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0])
# Have a look at what our X_test looks like
X_test.head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
age | anaemia | creatinine_phosphokinase | diabetes | ejection_fraction | high_blood_pressure | platelets | serum_creatinine | serum_sodium | sex | smoking | time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
61 | 50.0 | 0 | 318.0 | 0.0 | 40.0 | 1.0 | 216000.000000 | 2.3 | 131.0 | 0.0 | 0.0 | 60.0 |
297 | 45.0 | 0 | 2413.0 | 0.0 | 38.0 | 0.0 | 262269.408112 | 1.4 | 140.0 | 1.0 | 1.0 | 280.0 |
55 | 95.0 | 1 | 371.0 | 0.0 | 30.0 | 0.0 | 461000.000000 | 2.0 | 132.0 | 1.0 | 0.0 | 50.0 |
243 | 73.0 | 1 | 1185.0 | 0.0 | 40.0 | 1.0 | 220000.000000 | 0.9 | 141.0 | 0.0 | 0.0 | 213.0 |
95 | 58.0 | 1 | 133.0 | 0.0 | 60.0 | 1.0 | 219000.000000 | 1.0 | 141.0 | 1.0 | 0.0 | 83.0 |
6. Evaluate our model Performance
There are different classification evaluation metrics available evaluate our model's performance however the use of these metrics depends heavily on what problems you are solving.
They include as follows:
- Accuracy : Default metric for classification problems. Not the best for imbalanced classes
- Precision: Higher precision leads to less false positives
- Recall: Higher recall leads to less false negatives
- F1 Score: Usually a good overall metrics for classification model
- Confusion Matrix: Evaluate the accuracy of a classification
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)
0.9
8. Export our model
There are 2 ways we can save our model
- Using the pickle module
- Using the joblib module
1. Using the pickle module
import pickle as pkl
# Export with pickle
pkl.dump(model, open('your_first_model_pkl.pkl','wb')) #wb -> write binary
# Load with pickle
loaded_model_pkl = pkl.load(open('your_first_model_pkl.pkl', 'rb')) #rb -> read binary
# Make prediction
loaded_model_pkl.predict(X_test)
array([1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0,
0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0])
2. Using the Joblib module
import joblib as jbl
# Export with joblib
jbl.dump(model, 'your_first_model_pkl.joblib')
# Load with joblib
loaded_model_jbl = jbl.load('your_first_model_pkl.joblib')
# Make prediction
loaded_model_jbl.predict(X_test)
array([1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0,
0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0])
Top comments (4)
uhm, quick question bro, talking about feature engineering, are there any instances where we will have to use only some specific features because you seemed to use most of them and how do we know if the features we used are too much or too little so that our model can generalize well.
Hi Samuel, yes most cases you may need to perform what we call "Principal Component Analysis or Dimensionality Reduction". However in my experience I mostly do this when my feature variables maybe above 30 or more.
You can also run
model.feature_importances_
which will return the variables that contribute most to the prediction of your model, that way you know what variable are useful or useless. Hope this helps?It does , Thanks 😊
It does , Thanks 😊