DEV Community

Play Button Pause Button
Matt Hamilton
Matt Hamilton

Posted on • Edited on

Using a Convolutional Neural Network (CNN) to Detect Smiling Faces

What is a Convolutional Neural Network (CNN)? How can they be used to detect features in images? This is the video of a live coding session in which I show how to build a CNN in Python using Keras and extend the "smile detector" I built last week to use it.

A 1080p version of this video can be found on Cinnamon

A Convolutional Neural Network is a particular type of neural network that is very suited to analysing images. It works by passing a 'kernel' across the input image (convolution) to produce an output. These convolutional layers are stacked to produce a deep learning network and able to learn quite complex features in images.

A typical Convolutional Neural Network

In this session I coded a simple 3-layer CNN and trained it with manually classified images of faces.

Much of the code was based on the previous iteration of this. Subsequent to the live coding session, I actually refactored the code to use python generators to simplify the processing pipeline.

Frame Generator

This method opens the video file and iterates through the frames yielding each frame.

def frame_generator(self, video_fn):                                                                                                                                              
    cap = cv2.VideoCapture(video_fn)                                                                                                                                              

    while 1:                                                                                                                                                                      
        # Read each frame of the video                                                                                                                                            
        ret, frame = cap.read()                                                                                                                                                   

        # End of file, so break loop                                                                                                                                              
        if not ret:                                                                                                                                                               
            break                                                                                                                                                                 

        yield frame                                                                                                                                                               

    cap.release()
Enter fullscreen mode Exit fullscreen mode

Calculating the Threshold

Like in the previous session, we iterate through the frames to calculate the different between each frame and the previous one. It then returns the threshold needed in which to filter out just the top 5% of images:

def calc_threshold(self, frames, q=0.95):                                                                                                                                         
    prev_frame = next(frames)                                                                                                                                                     
    counts = []                                                                                                                                                                   
    for frame in frames:                                                                                                                                                          
        # Calculate the pixel difference between the current                                                                                                                      
        # frame and the previous one                                                                                                                                              
        diff = cv2.absdiff(frame, prev_frame)                                                                                                                                     
        non_zero_count = np.count_nonzero(diff)                                                                                                                                   

        # Append the count to our list of counts                                                                                                                                  
        counts.append(non_zero_count)                                                                                                                                             
        prev_frame = frame                                                                                                                                                        

    return int(np.quantile(counts, q))
Enter fullscreen mode Exit fullscreen mode

Filtering the Image Stream

Another generator that takes in an iterable of the frames and a threshold and then yields each frame whose difference from the previous frame is above the supplied threshold.

def filter_frames(self, frames, threshold):                                                                                                                                       
    prev_frame = next(frames)                                                                                                                                                     
    for frame in frames:                                                                                                                                                          
        # Calculate the pixel difference between the current                                                                                                                      
        # frame and the previous one                                                                                                                                              
        diff = cv2.absdiff(frame, prev_frame)                                                                                                                                     
        non_zero_count = np.count_nonzero(diff)                                                                                                                                   

        if non_zero_count > threshold:                                                                                                                                            
            yield frame                                                                                                                                                           

        prev_frame = frame
Enter fullscreen mode Exit fullscreen mode

Finding the Smiliest Image

By factoring out the methods above we can chain the generators together and pass them in to this method to actually look for the smiliest image. This means that (unlike the previous version) this method doesn't need to concern itself with deciding which frames to analyse.

We use the trained neural network (as a Tensorflow Lite model) to predict whether a face is smiling. Much of this structure is similar to last session in which we first scan the image to find faces. We then align each of those faces using a facial aligner -- this transforms the face such that the eyes are in the same location of each image. We pass each face into the neural network that gives us a score from 0 to 1.0 of how likely it is smiling. We sum all those values up in order to get an overall score of 'smiliness' for the frame.

def find_smiliest_frame(self, frames, callback=None):

    # Allocate the tensors for Tensorflow lite                                                                                                                                    
    self.interpreter.allocate_tensors()
    input_details = self.interpreter.get_input_details()
    output_details = self.interpreter.get_output_details()

    def detect(gray, frame):
        # detect faces within the greyscale version of the frame                                                                                                                  
        faces = self.detector(gray, 2)
        smile_score = 0

        # For each face we find...                                                                                                                                                
        for rect in faces:
            (x, y, w, h) = rect_to_bb(rect)
            face_orig = imutils.resize(frame[y:y + h, x:x + w], width=256)
            # Align the face                                                                                                                                                      
            face_aligned = self.face_aligner.align(frame, gray, rect)
            # Resize the face to the size our neural network expects                                                                                                              
            face_aligned = face_aligned.reshape(1, 256, 256, 3)
            # Scale to pixel values to 0..1                                                                                                                                       
            face_aligned = face_aligned.astype(np.float32) / 255.0
            # Pass the face into the input tensor for the network                                                                                                                 
            self.interpreter.set_tensor(input_details[0]['index'],
                                        face_aligned)
            # Actually run the neural network                                                                                                                                     
            self.interpreter.invoke()
            # Extract the prediction from the output tensor                                                                                                                       
            pred = self.interpreter.get_tensor(
                output_details[0]['index'])[0][0]

            # Keep a sum of all 'smiliness' scores                                                                                                                                
            smile_score += pred

        return smile_score, frame

    best_smile_score = 0
    best_frame = next(frames)

    for frame in frames:
        # Convert the frame to grayscale                                                                                                                                          
        gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

        # Call the detector function                                                                                                                                              
        smile_score, frame = detect(gray, frame)

        # Check if we have more smiles in this frame                                                                                                                              
        # than out "best" frame                                                                                                                                                   
        if smile_score > best_smile_score:
            best_smile_score = smile_score
            best_frame = frame
            if callback is not None:
                callback(best_frame, best_smile_score)

    return best_smile_score, best_frame
Enter fullscreen mode Exit fullscreen mode

We can then chain the functions together:

smiler = Smiler(landmarks_path, model_path)
fg = smiler.frame_generator(args.video_fn)
threshold = smiler.calc_threshold(fg, args.quantile)
fg = smiler.frame_generator(args.video_fn)
ffg = smiler.filter_frames(fg, threshold)
smile_score, image = smiler.find_smiliest_frame(ffg)
Enter fullscreen mode Exit fullscreen mode

Output

Testing it out it all works pretty well, and finds a nice snapshot from the video of smiling faces.

A frame of smiling people

The full code to this is now wrapped up as a complete Python package:

GitHub logo Choirless / smiler

Extract the most smiling image from a video clip

Smiler

This is a library and CLI tool to extract the "smiliest" of frame from a video of people.

It was developed as part of Choirless as part of IBM Call for code.

Installation

% pip install choirless_smiler

Usage

Simple usage:

% smiler video.mp4 snapshot.jpg

Output image of people singing

It will do a pre-scan to determine the 5% most changed frames from their previous frame in order to just consider them. If you know the threshold of change you want to use you can use that. e.g.

The first time smiler runs it will download facial landmark data and store it in ~/.smiler location of this data and cache directory can be specified as arguments

% smiler video.mp4 snapshot.jpg --threshold 480000

Help

% smiler -h
usage: smiler [-h] [--verbose] [--threshold THRESHOLD]
              [--landmarks-url LANDMARKS_URL] [--cache-dir CACHE_DIR]
              [--quantile QUANTILE]
              video_fn image_fn
Save thumbnail of smiliest frame in video

positional arguments:
  video_fn              filename for video to

I hope you enjoyed the video, if you want to catch them live, I stream each week at 2pm UK time on the IBM Developer Twitch channel:

https://developer.ibm.com/livestream

Top comments (1)

Collapse
 
lizard profile image
Lizard

Very nice project :)