Following the workflow of the Face Match system, this blog entry introduces the first core technique: face detection.
Face detection
How does the system identify which faces to detect?

Face detection is an essential step in face tagging. The detection rate strongly correlates with the final system recall of faces. We take careful steps to detect frontal faces as well as profile faces because the latter are indispensable for recalling whole-profile face tracks, which are abundant in premium videos. The detection of rotated faces is also a necessity. See Figure 1 below for an illustration of face poses we strive to detect with our detector, where yaw refers to different profile degrees ranging from -90 to 90 degrees, and rotation refers to in-plane rotation from -90 to 90 degrees. We do not run the full range of in-plane rotation due to efficiency considerations.


Figure 1. Out-plane and in-plane rotations of human face

Incorporating such variances in the detector complicates the architecture design. We need to carefully design the algorithm and parameters to achieve balance among accuracy, false detection rates and running speed. Please remember that the detector is the most time-consuming feature in the whole system.

Building a multi-view face detector

Face detection is a well-studied problem with a long research tradition. The state-of-the-art detector follows the sliding window approach to exhaustively scan all possible sub-windows in one image, and it relies on cascade-boosting architecture to quickly filter out negative examples. See Figure 2 for an illustration of the cascaded classifiers. Each stage (denoted as 1, 2, 3, etc.) is a classifier, which scores the sub-windows. The windows with scores below a certain threshold are discarded and only those with larger scores are passed. Thus with carefully designed classifiers, we can safely filter a certain portion of negative examples without falsely rejecting many truly positive examples. Though the number of sub-windows is huge for an image, most of the windows are negative examples and will run through only one or two stages. Thus the process is quite efficient for a single-face pose.


Figure 2. Cascade classifiers

However, parallel processing the different detectors for multiple poses ignores the structure of the face pose space and is inefficient. To facilitate the feature- and detector-sharing among different poses, various hierarchical detector structures have been proposed and implemented. We chose the pyramid structure for its simple and independent training process for the underlying component detectors. Pyramid structure is a coarse-to-fine partition of multi-view faces. See the following Figure 3 for an illustration of the yaw based partition process.


 Figure 3. Partition process of yaw angle

Our situation is a bit more complex since we need to deal with in-plane rotation and yaw rotation at the same time. Thus a branching node is needed to decide whether a given example will go to the in-plane rotation branch or the yaw rotation branch (Figure 4). More specifically, we train a five-stage all-pose face/non-face detector as the root node. Then we train two detectors for in-plane rotation and yaw rotation respectively, each with ten stages. The outputs of these two detectors are compared to select a subsequent branch hereafter. After that, the problem is converted to the solved problem of rotation in one dimension, be it in-plane rotation or yaw rotation. In a given branch, the same coarse-to-fine strategy is used. The final output incorporates both face position and pose estimation.


Figure 4. Whole face detector structure to handle multi-view faces

Usually for face detectors, Haar wavelet features are typically used in face detection because they are simple and fast. However, they often contain dimensions ranging in the tens of thousands. In contrast, the local binary pattern (LBP) feature is only a 58-bin sparse histogram. It captures the local image's geometric structure and is less sensitive to global illumination variations. Thus, we've adopted the LBP histogram for our system.

We've also integrated the boosting framework for training the classifier stages. We use a RankBoost like reweighting scheme in each round to balance the weights for positive and negative examples. This is useful to tune the classifiers to focus more on the limited positive examples. We also follow the nested cascade structure to further reduce the number of weak classifiers needed in the detector.

Synthetic examples like flipped, rotated version of faces with random small positions, scale and rotational transformations are created to enlarge the face dataset. In training, multiple threading techniques make the process more quickly.

Our multi-view face detector can detect faces in about 300ms for 640x360 images. The accuracy is about 80 percent for frontal faces and 60 percent for profile faces, both at 5 percent false detection rate.

This is the 2nd blog of Face Match tech blog series. You can browse the other 3 blogs in this series by visiting:

1. Face Match System – Overview

3. Face Match System – Shot Boundary Detection and Face Tracking

4. Face Match System – Clustering, Recognition and Summary