PART I: INTRODUCTION TO YOLO AND DATA FORMAT.
Detailed tutorial explaining the ABCs of YOLO model, dataset preparation and how to efficiently train the object detection algorithm YOLOv5 on using custom dataset.
This blog post tries to explain a presentation that was done at devfest22 Nairobi.
Introduction.
In recent years, advances in computer vision and machine learning have led to the development of more advanced object detection systems that can detect objects in real-time from video feeds of surveillance cameras or any recording. One popular approach for this task is the YOLO (You Only Look Once) object detection algorithm.
The YOLO (You Only Look Once) object detection algorithm is one of the popularaly used model for real-time object detection work since it is fast and accurate performance initially proposed by Redmon et al [https://arxiv.org/abs/1506.02640]. YOLO has proven to be a valuable tool in a wide range of applications.
In this tutorial we'll explore the working of the YOLO model and how it can be used for real-time fire detection using implimentation from Ultralytics [https://github.com/ultralytics/yolov5]. We will use transfer-learning techniques from P5 models (P5 models are model supported by ultralytics and differs in architecture and parameter size) to train our own model, evaluate its performances and use it for inference.
This tutorial is designed for people with theoretical background knowledge of object detection and computer vision who might need to seek practical implimentation. An Easy to use Notebook is provided with full code implementation for easier follow through.
The ABCs Of YOLO Model.
The general working of YOLO model is that it applies a single neural network to the full image. This network divides the image into regions and predicts bounding boxes and probabilities for each region. These bounding boxes are weighted by the predicted probabilities. It looks at the whole image at test time so its predictions are informed by global context in the image. It also makes predictions with a single network evaluation unlike other systems where you need to evaluate several times. It uses 3 techniques for detection which includes the following.
- Residual blocks (Creating an S * S grid from the image)
- Bounding box regression (predict the height, width, center, and class of objects)
- Intersection Over Union (IOU) - To check how bbox overlaps and get the best fit.
1. Residual Blocks
One of the key features of YOLO is its use of residual blocks to create an S * S grid from the input image. This grid is used to divide the image into a set of cells, each of which is responsible for predicting a fixed number of bounding boxes and class probabilities. The use of residual blocks allows YOLO to process the entire image in a single pass, making it well-suited for real-time object detection tasks.
2.Bounding Box Regression
In order to predict the locations of objects in the input image, YOLO uses bounding box regression. This technique involves predicting the height, width, center, and class of objects in the image. During training, the YOLO model learns to adjust the bounding boxes to better fit the objects in the training data. This allows the model to be more accurate at predicting the locations of objects in new images.
Every bounding box in the image consists of the following attributes; Width (bw), Height (bh), Class Label (c) and Bounding box center (bx,by). A single Bbox regression is used to predict the height, width, center, and class of objects.
The model then uses these attributes scores to predict bounding boxes for each cell. The use of anchor boxes in the YOLO model allows it to predict the locations of objects in the input image. An anchor box is a predefined set of bounding box dimensions that serve as a reference for predicting the bounding boxes of the objects in the image.
3. Intersection Over Union (IOU)
In addition to predicting bounding boxes, YOLO also uses intersection over union (IOU) to check how well the bounding boxes overlap with the ground truth boxes and select the best fit. IOU is calculated as the area of overlap between the predicted bounding box and the ground truth box, divided by the area of union between the two boxes. A high IOU score indicates a good overlap between the predicted and ground truth boxes, while a low IOU score indicates a poor overlap.
Each grid cell is responsible for predicting the bounding boxes and their confidence scores. The IOU is equal to 1 if the predicted bounding box is the same as the real box. This mechanism eliminates bounding boxes that are not equal to the real box.
By using these three techniques, YOLO is able to accurately detect objects in images and make predictions in real-time. This makes it a powerful tool for a wide range of object detection tasks, including real-time fire detection, pedestrian tracking, and more.
Real-Time Fire Detection With YOLOv5
Now that we've covered the basic working techniques of the YOLO model, let's look at how it can be used for real-time fire detection.
One use case for YOLO in fire detection is in the monitoring of surveillance cameras. By training a YOLOv5 model on a large dataset of images and videos of fires, it is possible to build a model that can detect fires in real-time video streams. Training a YoloV5 Model is very Easy, the Bigger part comes when the dataset is not in the format required as our case. YoloV5 expects the labels (Bounding Box Information) to be in a txt format with the same name as the Image.
For this Demo, We make a walk through the the end-to-end object detection project on a custom Fire dataset, using YOLOv5 implementation developed by Ultralytics. Check on the same Implimentation using latest version (YoloV7).
The Walkthrough.
1. Dataset Handling.
The structure of the dataset for YOLOv5 should follow the format of the Open Images dataset, which is organized into a hierarchy of folders with the following structure:
├── data.yaml
base_dir:
├── images
│ ├── train
│ └── validation
└── labels
├── train
└── validation
Each image file should be accompanied by a corresponding text file with the same name that contains the annotation information for the objects in the image. The annotation file should contain one line for each object in the image, with each line having the following format:
class_id x_center y_center width height
where class_id
is the integer id of the class of the object, x_center and y_center
are the coordinates of the center of the bounding box for the object, and width and height
are the dimensions of the bounding box. In this case, these values except the class_id must be NORMALIZAED BETWEEN 0 and 1
.
In our case, We have variables that can help us draw an Image BBOX. The (XMIN, YMIN) and (XMAX, YMAX) are the two corners for a bbox. The width and Height used is the one in which these annotation were extracted with. A sample snippet of how our data looks is as follows;
file_id | img_name | xmax | ymax | xmin | ymin | width | height |
---|---|---|---|---|---|---|---|
100 | WEBFire977 | WEBFire977.jpg | 428 | 389 | 376 | 335 | 1280 |
101 | WEBFire977 | WEBFire977.jpg | 764 | 474 | 462 | 368 | 1280 |
102 | WEBFire977 | WEBFire977.jpg | 1173 | 495 | 791 | 387 | 1280 |
103 | WEBFire977 | WEBFire977.jpg | 1293 | 522 | 1211 | 460 | 1280 |
Below is an example of the image with its annotation drawn.
2.Labelling Format.
At first, from the above dataframe, we need to extract the x and y centers, and also height and width for each object
. Each text file should contains one bounding-box (BBox) annotation for each of the objects in the image. The annotations are normalized to the image size, and lie within the range of 0 to 1. They are represented in the following format:
< object-class-ID> <X center> <Y center> <Box width> <Box height>
- If there are more than one objects in the image, the content of the YOLO annotations text file might look like this:
0 0.563462 0.686216 0.462500 0.195205
7 0.880769 0.796447 0.041346 0.112586
2 0.880769 0.796447 0.041346 0.112586
0 0.564663 0.679366 0.463942 0.181507
0 0.566106 0.658390 0.469712 0.192637
1 0.565144 0.359803 0.118750 0.107449
Each Value is separated by a space and For Information for each object is on Its new Line. Since the annotations needs to be normalized, lets Normalize them and Extract the Center and Dimension for Each Fire object
. In normalization, the bbox are divided by height if its y else width. To get the center, we get sum of either x or y and then divided by 2. To get Height and Width, we subtract xmax/ymax by xmin/ymin. Below is a code snippet for the same.
# first, normalization of the bbox information to be between a range of 1 and 0.
#We divide by height or width
df['x_min'] = df.apply(lambda record: (record.xmin)/record.width, axis =1)
df['y_min'] = df.apply(lambda record: (record.ymin)/record.height, axis =1)
df['x_max'] = df.apply(lambda record: (record.xmax)/record.width, axis =1)
df['y_max'] = df.apply(lambda record: (record.ymax)/record.height, axis =1)
# extract the Mid point location
df['x_mid'] = df.apply(lambda record: (record.x_max+record.x_min)/2, axis =1)
df['y_mid'] = df.apply(lambda record: (record.y_max+record.y_min)/2, axis =1)
# Extract the height and width of the object
df['w'] = df.apply(lambda record: (record.x_max-record.x_min), axis =1)
df['h'] = df.apply(lambda record: (record.y_max-record.y_min), axis =1)
After applying the above functionality, Since we have a dataframe and a single image can have more than one object, for easier labels creation, all unique files with their objects will be on a single row i.e I will create a dictionary that has a list of annotations for each image. Information regarding the object are going to be inside a list i,e A list of Dictionary Example for a single File:
[
{'x_min': 0.4677716390423573, 'y_min': 0.3788, "x_max":0.12435,"y_max":0.234352, "x_mid":0.8829343, "y_mid":0.23435, "w":0.23, "h":0.1234},
{.....},
{.....},
..........
]
Below is a code snippet that creates a dictionary for each Image.
# a list to hold all unique files information. It will help in easier conversion to dataframe
TRAIN =[]
for img_id in tqdm(df['file_id'].unique()):
#get all rows that has the current id
curr_df = df[df['file_id'] ==img_id].reset_index(drop=True)
#get unique information
base_details = dict(curr_df.loc[0][['file_id','img_name', 'width', 'height']])
# a list to hold bbox annotation information
information =[]
#iterate through the whole records of the current id while extracting their annotation informations
for indx in range(curr_df.shape[0]):
#get their information as dic and add to the informatiuon list above
other_details = dict(curr_df.loc[indx][["x_min", "y_min","x_max","y_max", "x_mid", "y_mid", "w", "h", "area" ]])
information.append(other_details)
# append information for the current file
TRAIN.append([base_details['file_id'], base_details['img_name'],base_details['width'],base_details['height'],information])
# create a datafrmae from the above created list.
processed_df = pd.DataFrame(TRAIN, columns =['image_id', "img_name", "width", "height", "information"])
The next step after this will be splitting the dataset for both training and validation. This will follow in the next post.
Link to the Notebook GITHUB
3. Training and Validation (INCOMPLETE).
You can Check the Other part from this link Link to Part II of the blog
Top comments (0)