Pascal Thormeier

Posted on Jul 18, 2022

Funny Hat Day! 👒🎩 How to do face detection with your webcam and JavaScript 📸🧠

#javascript #machinelearning #webdev #tutorial

(Cover image created with Dall-E mini and the caption "an AI wearing a funny top hat" - you know, because we're doing machine learning stuff today.)

It's been a while since my last post. I'm working on something rather large; you can expect some news soon!

But today, we'll have a look at you. Yes, you. Specifically, your beautiful faces. We'll make you wear hats. We'll use face-api.js and the Media Stream API for that.

Don't worry, though. Nothing will be processed in the cloud or anywhere outside your machine, you'll keep your images, and everything happens in your browser.

Let's get started!

Boilerplate

First, we'll need some HTML: A <video> element, a hat, two buttons for starting and stopping the video, and two <select> for selecting a hat and the device. You know, you might have two webcams.



<div class="container">
  <div id="hat">
    🎩
  </div>
  <!-- autoplay is important here, otherwise it doesn't immediately show the camera input. -->
  <video id="video" width="1280" height="720" autoplay></video>
</div>

<div>
  <label for="deviceSelector">
    Select device
  </label>
  <select id="deviceSelector"></select>
</div>

<div>
  <label for="hatSelector">
    Select hat
  </label>
  <select id="hatSelector"></select>
</div>

<button id="start">
  Start video
</button>

<button id="stop">
  Stop video
</button>

Next, some CSS for the positioning of the hat:



#hat {
  position: absolute;
  display: none;
  text-align: center;
}
#hat.visible {
  display: block;
}
.container {
  position: relative;
}

Awesome. Next, we install face-api.js with npm and create an index.js file for us to work in:



npm i face-api.js && touch index.js

And lastly for the boilerplate, we select all elements we need from the HTML:



/**
 * All of the necessary HTML elements
 */
const videoEl = document.querySelector('#video')
const startButtonEl = document.querySelector('#start')
const stopButtonEl = document.querySelector('#stop')
const deviceDropdownEl = document.querySelector('#deviceSelector')
const hatSelectorEl = document.querySelector('#hatSelector')
const hatEl = document.querySelector('#hat')

Awesome. Let's get to the fun part.

Accessing the webcam

To access a webcam, we will use the Media Stream API. This API allows us to access video and audio devices, but we're only interested in video devices. Also, we'll cache those devices in a global variable to not have to fetch them again. So let's have a look:



const listDevices = async () => {
  if (devices.length > 0) {
    return
  }

  devices = await navigator.mediaDevices.enumerateDevices()
  // ...
}

The mediaDevices object lets us access all devices, both video and audio. Each device is an object of either the class InputDeviceInfo or MediaDeviceInfo. These objects both roughly look like this:



{
  deviceId: "someHash",
  groupId: "someOtherHash"
  kind: "videoinput", // or "audioinput"
  label: "Some human readable name (some identifier)"
}

The kind is what's interesting to us. We can use that to filter for all videoinput devices, giving us a list of available webcams. We will also add these devices to the <select> we've added in the boilerplate and mark the first device we encounter as the selected one:



/**
 * List all available camera devices in the select
 */
let selectedDevice = null

let devices = []

const listDevices = async () => {
  if (devices.length > 0) {
    return
  }

  devices = (await navigator.mediaDevices.enumerateDevices())
    .filter(d => d.kind === 'videoinput')

  if (devices.length > 0) {
    deviceDropdownEl.innerHTML = devices.map(d => `
      <option value="${d.deviceId}">${d.label}</option>
    `).join('')

    // Select first device
    selectedDevice = devices[0].deviceId
  }
}

Now, we'll actually show the webcam input to the user. For that, the Media Stream API offers the getUserMedia method. It receives a config object as an argument that defines what exactly we want to access how. We don't need any audio, but we need a video stream from the selectedDevice. We can also tell the API our preferred video size. Finally, we assign the output of this method to the <video>, namely its srcObject:



const startVideo = async () => {
  // Some more face detection stuff later

  videoEl.srcObject = await navigator.mediaDevices.getUserMedia({
    video: {
      width: { ideal: 1280 },
      height: { ideal: 720 },
      deviceId: selectedDevice,
    },
    audio: false,
  })

  // More face detection stuff later
}

That should do the trick. Since the <video> has an autoplay attribute, it should immediately show what the cam sees. Unless we didn't allow the browser to access the cam, of course. But why wouldn't we, right? After all, we want to wear hats.

If the hat-wearing is getting a bit too spooky, we would also like to stop the video. We can do that by first stopping each source object's track individually and then clearing the srcObject itself.



const stopVideo = () => {
  // Some face detection stuff later on

  if (videoEl.srcObject) {
    videoEl.srcObject.getTracks().forEach(t => {
      t.stop()
    })
    videoEl.srcObject = null
  }
}

Now we can start and stop the video. Next up:

Doing the face recognition

Let's get the machine learning in. During the boiler plating, we installed face-api.js, which is a pretty fantastic lib to do all kinds of ML tasks regarding face recognition, detection and interpretation. It can also detect moods, tell us where different parts of the face, such as the jawline or the eyes, are and is capable of using different model weights. And the best part: It doesn't need any remote service; we only need to provide the correct model weights! Given, these can be rather large, but we only need to load them once and can do face recognition for the rest of the session.

First, we need the models, though. The face-api.js repo has all the pre-trained models we need:

face_landmark_68_model-shard1
face_landmark_68_model-weights_manifest.json
ssd_mobilenetv1_model-shard1
ssd_mobilenetv1_model-shard2
ssd_mobilenetv1_model-weights_manifest.json
tiny_face_detector_model-shard1
tiny_face_detector_model-weights_manifest.json

We put those in a folder called model and make face-api load these:



let faceApiInitialized = false

const initFaceApi = async () => {
  if (!faceApiInitialized) {
    await faceapi.loadFaceLandmarkModel('/models')
    await faceapi.nets.tinyFaceDetector.loadFromUri('/models')

    faceApiInitialized = true
  }
}

The face landmarks are what we need: They represent a box with x and y coordinates, a width and a height value. We could use the facial features to have more precision, but for the sake of simplicity, we'll use the landmarks instead.

With face-api.js, we can create an async function to detect a face in the stream of the video element. face-api.js does all the magic for us, and we only need to tell it in which element we want to look for faces and what model to use. We need to initialize the API first, though.



const detectFace = async () => {
  await initFaceApi()

  return await faceapi.detectSingleFace(videoEl, new faceapi.TinyFaceDetectorOptions())
}

This will return us an object called dd with an attribute called _box. This box contains all kinds of information, namely sets of coordinates for every corner, x and y coordinates of the top-left corner, width and height. To position the box that contains the hat, we need the top, left, width and height attributes. Since every hat emoji is slightly different, we cannot simply put them right over the face - they wouldn't fit.

So, let's add the hats and some way to customize the hats' positioning:



/**
 * All of the available hats
 */
const hats = {
  tophat: {
    hat: '🎩',
    positioning: box => ({
      top: box.top - (box.height * 1.1),
      left: box.left,
      fontSize: box.height,
    }),
  },
  bowhat: {
    hat: '👒',
    positioning: box => ({
      top: box.top - box.height,
      left: box.left + box.width * 0.1,
      width: box.width,
      fontSize: box.height,
    }),
  },
  cap: {
    hat: '🧢',
    positioning: box => ({
      top: box.top - box.height * 0.8,
      left: box.left - box.width * 0.10,
      fontSize: box.height * 0.9,
    }),
  },
  graduationcap: {
    hat: '🎓',
    positioning: box => ({
      top: box.top - box.height,
      left: box.left,
      fontSize: box.height,
    }),
  },
  rescuehelmet: {
    hat: '⛑️',
    positioning: box => ({
      top: box.top - box.height * 0.75,
      left: box.left,
      fontSize: box.height * 0.9,
    }),
  },
}

The main reason for

Since we haven't used the <select> for the hats just yet, let's add this next:



let selectedHat = 'tophat'

const listHats = () => {
  hatSelectorEl.innerHTML = Object.keys(hats).map(hatKey => {
    const hat = hats[hatKey]

    return `<option value="${hatKey}">${hat.hat}</option>`
  }).join('')
}

How to wear hats

Now we can start glueing things together. With the selectedHat variable and the box, we can now position the selected hat on the detected face:



/**
 * Positions the hat by a given box
 */
const positionHat = (box) => {
  const hatConfig = hats[selectedHat]
  const positioning = hatConfig.positioning(box)

  hatEl.classList.add('visible')
  hatEl.innerHTML = hatConfig.hat
  hatEl.setAttribute('style', `
    top: ${positioning.top}px; 
    left: ${positioning.left}px; 
    width: ${box.width}px; 
    height: ${box.height}px; 
    font-size: ${positioning.fontSize}px;
  `)
}

As you can see, we're using CSS for that. Of course, we could paint it with a canvas and whatnot, but CSS makes things more straightforward and less laggy.

Now we need to integrate the face detection into the startVideo and the stopVideo functions. I'll show the entire code of these functions here for completeness.



/**
 * Start and stop the video
 */
let faceDetectionInterval = null

const startVideo = async () => {
  listHats()
  await listDevices()

  stopVideo()

  try {
    videoEl.srcObject = await navigator.mediaDevices.getUserMedia({
      video: {
        width: { ideal: 1280 },
        height: { ideal: 720 },
        deviceId: selectedDevice,
      },
      audio: false
    })

    faceDetectionInterval = setInterval(async () => {
      const positioning = await detectFace()

      if (positioning) {
        positionHat(positioning._box)
      }
    }, 60)
  } catch(e) {
    console.error(e)
  }
}

const stopVideo = () => {
  clearInterval(faceDetectionInterval)
  hatEl.classList.remove('visible')

  if (videoEl.srcObject) {
    videoEl.srcObject.getTracks().forEach(t => {
      t.stop()
    })
    videoEl.srcObject = null
  }
}

As you can see, we're using an interval here to position everything. Due to the nature of face detection, it would be way too jiggly if we did it any more frequent. It already is quite jiggly, but around 60ms makes it at least bearable.

Last, we add some event listeners, and we're good to go:



/**


Event listeners
*/
startButtonEl.addEventListener('click', startVideo)



stopButtonEl.addEventListener('click', stopVideo)

deviceDropdownEl.addEventListener('change', e => {

  selectedDevice = e.target.value

  startVideo()

})

hatSelectorEl.addEventListener('change', e => {

  selectedHat = e.target.value

})

The result

And here's the result:

Depending on your system, the hats may very well be off because every system renders emojis differently. Also, give it a second to actually load the model weights, they take a few seconds. For best results, view on a large screen and open the sandbox in a new tab. Obviously, the tab needs camera access.

If you'd like, how about sharing a screenshot of you wearing your favorite hat emoji in the comments?

I hope you enjoyed reading this article as much as I enjoyed writing it! If so, leave a ❤️ or a 🦄! I write tech articles in my free time and like to drink coffee every once in a while.

If you want to support my efforts, you can offer me a coffee ☕ or follow me on Twitter 🐦! You can also support me directly via Paypal!