Object tracking and video cropping with computer vision and machine learning

#machinelearning #computervision #yolo #python

Imagine you have a video footage captured by a fixed camera, and you want to have a video clip following an object on the screen. It could be a person walking left and right on a stage. Or a car moving. Or a tennis player with sudden movements. Sometimes it is difficult to follow an object with a camera. But it could be much easier if you take a wider shot and then crop it with a focus on a specific object. But the manual processing of this video could consume a lot of time. And this is the exact situation where machine learning can help. It is smart enough to find and capture an object on the screen. But we need another software to crop the video stream.

Both of these solutions - tracking an object and video stream and image processing are parts of the computer vision technology used in many applications and software systems.

I am going to present a simple implication of these technologies with a software code written in Python programming language. I am using a state-of-art YOLOv8 machine learning model for the object detection task and the OpenCV library for video processing purposes.

For a final solution there should be a lot of math with selecting a proper object and movement smoothing etc., but the basic logic is following:

Open an input video stream (I use a video file, however it could be screen or camera capture etc.)
Open an output video stream (video file in the example)
Read a frame from the input video stream as an image
Find a position of the object in the image
Crop the image around the object
Add the cropped image to the output video stream
Read the next frame from the input video stream until the end of the video.
If the video isn’t finished, then proceed to the step 4
Close the input video stream
Close the output video stream

This is the video processing only. If you want to keep the original audio track in the video then the best way to do it is to extract the audio from the original video and combine the resulting video with the original audio using FFMPEG or MoviePy libraries.

Next, I will show all the steps with appropriate software code in Python for each step.

Initialization

But first, we need to install some libraries for it and initialize our machine learning model. I will use YOLOv8 model here since it is the latest and the best pretrained model for object detection. This model has also the ability to classify and segment object and process different data sources including video camera streams, videofiles and images. It is pretrained on the COCO dataset, but you can improve its performance by training it on a custom dataset. It has several models of different size from nano to xlarge. Generally speaking, the bigger model gives you more precise results but works slower. You should choose one that better fits your needs. I use the small model here since from my experience it performs well enough without significant decrease in speed.
First, we need to create a virtual environment so all our software is isolated from other projects and environments. You can do it with Anaconda or Python. Anaconda is a very useful package for data science and machine learning. With Anaconda you need to install it first (https://www.anaconda.com/products/distribution#Downloads). Then you can create a virtual environment with the following console command:
conda create --name myenv
Then, you have to activate your virtual environment with the command:
conda activate myenv
The next code will not just install the YOLOv8 model, but also all of its required dependencies:
pip install ultralytics
Next command will install OpenCV library. It is used for the video and image processing.
pip install opencv-python
Sometimes you need to compile OpenCV from source or adjust the package, but usually the base approach works.
That's all, we are ready to start writing our Python script for the computer vision task. The initialisation part of the script:

from ultralytics import YOLO # import our machine learning model
import cv2 # import OpenCV
model = YOLO("yolov8s.pt")  # load a pretrained model

fileSource = 'test_video.mp4' # this is the source file we will process
fileTarget = 'test_vide0_processed.mp4' # this is the file path where processed video will be saved
cropCoords = [100,100,500,500] # coordinates of the cropping box we will start with, this cropping box will follow our object

Open an input video stream.

Also I will adjust the size of the cropping box if its size is bigger than the size of the video.

vidCapture = cv2.VideoCapture(fileSource)
fps = vidCapture.get(cv2.CAP_PROP_FPS)
totalFrames = vidCapture.get(cv2.CAP_PROP_FRAME_COUNT)
width = int(vidCapture.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(vidCapture.get(cv2.CAP_PROP_FRAME_HEIGHT))
if not cropCoords:
    [box_left, box_top, box_right, box_bottom] = [0, 0, width, height]
else:
    [box_left, box_top, box_right, box_bottom] = cropCoords
    if (box_left<0):
        box_left=0
    if (box_top<0):
        box_top=0
    if (box_right)>width:
        box_right=width
    if (box_bottom>height):
        box_bottom=height
lastCoords = [box_left, box_top, box_right, box_bottom]
lastBoxCoords = lastCoords
box_width = box_right-box_left
box_height = box_bottom-box_top

Open an output video stream.

I use MPEG codec for the video file, the same FPS as the input video stream (so the video speed will be the same, but you can adjust that), and the video dimensions as the size of the cropping box. You can resize the processed image also, but you should be consistent here, all the frames must have the same size as the output video stream size defined in the constructor.

outputWriter = cv2.VideoWriter(fileTarget, cv2.VideoWriter_fourcc(*'MPEG'), fps, (box_width, box_height))

Read a frame from the input video stream as an image. Read the next frame from the input video stream until the end of the video.

All the operations are performed inside the while cycle until the end of the input video stream.

frameCounter = 1
while True:

    r, im = vidCapture.read()

    if not r:
        print("Video Finished!")
        break

    print("Frame: "+str(frameCounter))
    frameCounter = frameCounter+1

Find a position of the object in the image.

Sometimes video can have several objects. I would say that usually a video contains many objects. And since we process video frame-by-frame, the machine learning model finds many objects in the frame. Our task is to figure out which object is the correct one. I use pretty complicated logic to find it out, but for the purpose of this tutorial the easiest way to do it is to find the closest object to the position of the selected object in the previous frame. This approach works since frames usually don't change much.

    results = model.predict(source=im, conf=0.5, iou=0.1) # request for the YOLO model to find objects, you can see the documentation on the YOLO model for params
    boxes = results[0].boxes # boxes are coordinates of objects YOLO has found
    box = closestBox(boxes, lastBoxCoords)  # returns the best box - closest to the last one
    lastBoxCoords = box.xyxy[0].numpy().astype(int) # converts the PyTorch Tensor into box coordinates and saves for the next iteration

Crop the image around the object

    newCoords = adjustBoxSize(box.xyxy[0].numpy().astype(int), box_width, box_height) # since the area YOLO has found for the object depends on the object but not on the cropping area we need to convert the area of the object to the cropping area
    newCoords = adjustBoundaries(newCoords,[width, height]) # don't allow to get the cropping area go out of video screen edges
    [box_left, box_top, box_right, box_bottom] = newCoords
    imCropped = im[box_top:box_bottom, box_left:box_right] # cropping the image

Add the cropped image to the output video stream

    outputWriter.write(imCropped) # writing the cropped image as the new frame into the output video stream

This is the end of the code block inside the while loop. The program will get from the input stream and process the next frame.

Close input and output video streams

After all frames are processed we have to close the video streams.

vidCapture.release()
outputWriter.release()

You can set the source and result file names with fileSource and fileTarget variables or you can use environment or other ways to tell the program what files to process.

Additional support functions

Also we need three functions here: closestBox to find the best next object, adjustBoxSize to convert the size of the object to the size of the cropping area and adjustBoundaries to keep the cropping area inside the video boundaries. Also I use one additional function boxCenter that returns horizontal and vertical coordinates of an area's center.

def boxCenter(coords):
    [left, top, right, bottom] = coords
    return [(left+right)/2,(top+bottom)/2]

def closestBox(boxes, coords):
    distance = []
    center = boxCenter(coords)
    for box in boxes:
        boxCent = boxCenter(box.xyxy[0].numpy().astype(int))
        distance.append(math.dist(boxCent,center))
    return boxes[distance.index(min(distance))]

def adjustBoxSize(coords, box_width, box_height):
    [centerX, centerY] = boxCenter(coords)
    return [centerX-box_width/2, centerY-box_height/2, centerX+box_width/2, centerY+box_height/2]

def adjustBoundaries(coords, screen):
    [left, top, right, bottom] = coords
    [width, height]=screen
    if left<0:
        right=right-left
        left=0
    if top<0:
        bottom=bottom-top
        top=0
    if right>width:
        left=left-(right-width)
        right=width
    if bottom>height:
        top=top-(bottom-height)
        bottom=height
    return [round(left), round(top), round(right), round(bottom)]

And that's all for now.
You could add smoothing to movements of the video, remove background, adjust lighting and perform a lot of other operations with the video. But this article describes basic principles how you can use machine learning, computer vision and useful Python libraries to process any videos you like.