Welcome, this post is a quick explanation on how I build mask detection using ResNet50 as feature extractor and then use Support Vector Machine (SVM) + Decision Tree with stacking ensemble method as classifier.
As tribute to fellow researcher, this app was based on research paper with title "A hybrid deep transfer learning model with machine learning methods for face mask detection in the era of the COVID-19 pandemic" written by Mohamed Loey, et al.
Table of contents:
- Dataset Retrieval
- Preprocessing
- Feature Extraction
- Split Dataset
- Define Model Classifier
- Tuning Model
- Create Final Model
- Deploy Real App
Dataset Retrieval
This application uses a dataset from Kaggle. This dataset contains 853 images belonging to the 3 classes, as well as their bounding boxes in the PASCAL VOC format. The classes are with_mask, without_mask, and mask_weared_incorrect. For some reason, I only use the with_mask and without_mask labels. Check out this image sample below.
You can access this dataset via this url below.
https://www.kaggle.com/datasets/andrewmvd/face-mask-detection
Preprocessing
Preprocessing can be achieved by cropping face area based on bounding box information. First, read all xml file and image file from dataset folder.
import os
img_names = []
xml_names = []
for dirname, _, filenames in os.walk('./face-mask-detection'):
for filename in filenames:
if os.path.join(dirname, filename)[-3:] != "xml":
img_names.append(filename)
else:
xml_names.append(filename)
print(len(img_names), "images")
Then crop all images by its bounding box and read the label.
import xmltodict
from matplotlib import pyplot as plt
from skimage.io import imread
path_annotations = "face-mask-detection/annotations/"
path_images = "face-mask-detection/images/"
class_names = ['with_mask', 'without_mask']
images = []
target = []
def crop_bounding_box(img, bnd):
x1, y1, x2, y2 = list(map(int, bnd.values()))
_img = img.copy()
_img = _img[y1:y2, x1:x2]
_img = _img[:,:,:3]
return _img
for img_name in img_names[:]:
with open(path_annotations+img_name[:-4]+".xml") as fd:
doc = xmltodict.parse(fd.read())
img = imread(path_images+img_name)
temp = doc["annotation"]["object"]
if type(temp) == list:
for i in range(len(temp)):
if temp[i]["name"] not in class_names:
continue
images.append(crop_bounding_box(img, temp[i]["bndbox"]))
target.append(temp[i]["name"])
else:
if temp["name"] not in class_names:
continue
images.append(crop_bounding_box(img, temp["bndbox"]))
target.append(temp["name"])
Based on labels, this dataset consists of 3232 with mask faces and 717 without mask faces.
This preprocessing also contains resize and normalization steps for ImageNet.
import torch
from torchvision import transforms
# Define preprocessing
preprocess = transforms.Compose([
transforms.ToPILImage(),
transforms.Resize((128, 128)),
transforms.ToTensor(),
transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
])
# Apply preprocess
image_tensor = torch.stack([preprocess(image) for image in images])
image_tensor.shape
Feature Extraction
Feature extraction is needed to gather information from images using spatial operations to extract something that represents a label. In this application, I use ResNet50 as a feature extractor. The last layer of ResNet, which is a fully connected layer with 1.000 neurons, needs to be deleted.
from torchvision import models
# Download model
resnet = models.resnet50(pretrained=True)
resnet = torch.nn.Sequential(*(list(resnet.children())[:-1]))
To freeze and keep the convolutional part of ResNet50 fixed, I need to set requires_grad
to False
.
for param in resnet.parameters():
param.requires_grad = False
I also need to call eval()
to set ResNet50's batch normalization to disabled. Which will interfere with model accuracy and make sure ResNet50 only acts as a feature extractor.
resnet.eval()
Last step apply ResNet50 to extract feature. Then ResNet will return a vector with 2048 features for each image.
import numpy as np
result = np.empty((len(image_tensor), 2048))
for i, data in enumerate(image_tensor):
output = resnet(data.unsqueeze(0))
output = torch.flatten(output, 1)
result[i] = output[0].numpy()
Split Dataset
To prevent the model from overfitting, I needed to split the data into 70% train data and 30% test data. Train data will be used to train the model and test data will be used to test or validate the model.
from sklearn.model_selection import train_test_split
X, y = result, np.array(target)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print("Training data\n", np.asarray(np.unique(y_train, return_counts=True)).T)
print("Test data\n", np.asarray(np.unique(y_test, return_counts=True)).T)
Define Model Classifier
As I have teased before, the proposed model is a stacking classifier (ensemble method) that will use SVM and decision tree as weak learners. Logistic regression will be the final estimator. In short definition, ensemble methods are techniques that create multiple models and then combine them to produce improved results. Ensemble methods usually produce more accurate solutions than a single model would.
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
clf = StackingClassifier(
estimators=[('svm', SVC(random_state=42)),
('tree', DecisionTreeClassifier(random_state=42))],
final_estimator=LogisticRegression(random_state=42),
n_jobs=-1)
Tuning Model
Tuning is the process of maximizing a model's performance without overfitting or creating too high of a variance. In machine learning, this is accomplished by selecting appropriate "hyperparameters". You can define your own tuning method what ever you want. But here is mine.
from sklearn.model_selection import GridSearchCV
param_grid = {
'svm__C': [1.6, 1.7, 1.8],
'svm__kernel': ['rbf'],
'tree__criterion': ['entropy'],
'tree__max_depth': [9, 10, 11],
'final_estimator__C': [1.3, 1.4, 1.5]
}
grid = GridSearchCV(
estimator=clf,
param_grid=param_grid,
scoring='accuracy',
n_jobs=-1)
grid.fit(X_train, y_train)
print('Best parameters: %s' % grid.best_params_)
print('Accuracy: %.2f' % grid.best_score_)
Based on the tuning process, the best hyperparameters are:
Best parameters: {'final_estimator__C': 1.3, 'svm__C': 1.6, 'svm__kernel': 'rbf', 'tree__criterion': 'entropy', 'tree__max_depth': 11}
Accuracy: 0.98
Create Final Model
Finally, I can create a final model with the best hyperparameters. I hope this model will not overfit.
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
final_clf = StackingClassifier(
estimators=[('svm', SVC(C=1.6, kernel='rbf', random_state=42)),
('tree', DecisionTreeClassifier(criterion='entropy', max_depth=11, random_state=42))],
final_estimator=LogisticRegression(C=1.3, random_state=42),
n_jobs=-1)
final_clf.fit(X_train, y_train)
y_pred = final_clf.predict(X_test)
print('Accuracy score : ', accuracy_score(y_test, y_pred))
print('Precision score : ', precision_score(y_test, y_pred, average='weighted'))
print('Recall score : ', recall_score(y_test, y_pred, average='weighted'))
print('F1 score : ', f1_score(y_test, y_pred, average='weighted'))
Then I test the model with test data based on accuracy, precision, recall, and f1 score. The result are:
Accuracy score : 0.9721518987341772
Precision score : 0.9719379890530496
Recall score : 0.9721518987341772
F1 score : 0.9717932606523529
Looks pretty good! Check out this confusion matrix. If it's biased, please comment 😁.
Deploy Real App
This step is not required. But if you are interested, you must export the model first. Only the stacking classifier model, which was trained before. So you can load again in another program.
import pickle
pkl_filename = 'face_mask_detection.pkl'
with open(pkl_filename, 'wb') as file:
pickle.dump(final_clf, file)
This process might be simple, but first you need to check out this diagram below.
Important thing to remember is you need to implement your own face detection model and crop it. For my example of program, check out my Github Repository.
Top comments (0)