Following the workflow of the Face Match system, this blog entry introduces the third core technique: face track clustering and recognition.
Track tagging

When a series of face tracks have been extracted from a set of videos, the next step is to tag them automatically with some probable actor names from the given show. After all, manually processing all the tracks from scratch would be infeasible. The tags, with some sort of acceptable accuracy rate — let's say 80 percent — provide valuable cues for a person to verify the tracks in groups. When presented in a user-friendly interface, the tags also improve the speed required to correct erroneous matches. Given that, we are seeking ways to improve tagging accuracy for face tracks. This naturally falls into the machine-learning framework, which is widely adopted in the computer vision research community. In this blog entry, we refer to this problem of automatically annotating faces (not tracks) as face tagging.

Traditionally, face verification technology tries to identify whether a given image belongs to a specific person in a set of candidates. Though successfully applied in controlled environments, the approach has strict assumptions: the video must have good lighting, the actors must be facing the camera, and their faces cannot be obscured. However, these assumptions do not hold in challenging environments such as TV shows or movies. The general research interest has recently turned toward solving for uncontrolled datasets like "Labeled Faces in the Wild". And the benchmark of identifying whether two faces belong to the same person has attracted a lot of attention. However, the LFW database contains a lot of person with only two face samples. Thus, the benchmark could hardly cover the case of identifying many people in many poses and true wild environment.

In the machine-learning framework, the problem of track tagging essentially boils down to how to construct a proper track similarity function as a function of the similarity of the faces in the tracks. Because we are facing the largest dataset for face verification in research history, the time and labor for human verification have become the most critical metrics. Only by improving the accuracy of track tagging can we significantly reduce the time and labor for verification. There are a few aspects that impact the results: 1) The features of the toolset; 2) The learning approach; 3) The cold start problem. Still, because of the very large dataset, we are also constrained by the amount of processing time we can afford. Given the potential number of all videos available to Hulu, we need to reduce the processing time to less than one second per face image. Thus, we cannot afford the recent effective, yet heavy, methods such as dense local feature based methods. Next, we will elaborate on each of these aspects.

Features extraction

In the current system, we leverage quite a few kinds of visual information to improve the tagging accuracy. Compared with a single image, we are equipped with the temporal information provided by continuous face tracks. Fusing these tracks into a 3-D face model is an interesting alternative for us to explore in the future. For now, we've limited ourselves to select a few representative faces and have constructed the track similarity function as a function of the similarity of the representative faces.

First, we resize the image to ensure the face region is 80x80 pixels. Then we enlarge the selected region to 80x160 pixels by extending 40 pixels up and 40 pixels down, respectively. See Figure 1 for an example.

Standard face features such as global face features and LBP (local binary pattern) facial features are extracted on the global region and local regions respectively. The global face feature is extracted on the aligned faces with a 16x16 grid layout, with each grid containing a 58-dim LBP feature. The LBP facial features are extracted on each local facial window with a 4x4 grid layout and a 58-dim histogram is accumulated for each grid by different LBP code. These local histograms are concatenated into a vector of 928 dims.

A few face verification approaches require face alignment and face warp as a preprocessing step. The alignment process identifies landmark points in the face, e.g. the corners of the eye, the mouth and the nose. Then the face can be warped to the frontal position by triangulating the facial features and finding the affine mapping. Therefore, global face features can be extracted on distorted faces as well. However, in our experiments, we did not see much improvement using this step. This may due to the fragility of the alignment algorithm we used.

We assume that the given character's appearance will not change often in one video. So we further incorporate a few other features to reflect the character's appearance, including hair and face, as well as the environment in which he or she appears. More specifically, we extract texture and color features in respective areas of the face image to reflect hair and scenery. The LBP feature is also extracted on the full 80x160 region to represent the face as a whole. The importance weights among different modalities are learned afterward with some label information for face tracks.

fm41

Figure 1. Feature extraction for face tracks

Learning approach

The primary goal in this step is to construct a proper track similarity function as a function of the similarity of the underlying faces across the tracks.

Given a new video, the tracks of an actor usually will be more similar to tracks from the same actor in this specific video than tracks of the same actor from other videos. This is because the appearance of the actor will remain the mostly same in one given video. Thus label information from the current video is more valuable than that from other videos. With these labels, we can expect higher tagging accuracy, so we adopt an online learning scheme to incorporate the newly verified track labels from the current video at the earliest time.

As we need to handle several tens of thousands of actors in our system, building and maintaining a supervised model for all possible actors is infeasible, even though we need to deal with only 100 to 500 actors for a given show. Given online learning and a huge number of candidates, we adopt a k-Nearest Neighbor (kNN) based lazy-learning approach to annotate the faces independently and then vote among the face tags to determine the tag for the given track. The merit of such lazy learning is that we do not need to maintain any learned model and the newly acquired labels can be added instantly. As shown in Figure 2, after feature extraction, approximate kNN scheme is used to speed up the neighbor-finding process. For a face track X, the jth feature of ith face in X is denoted as Xij, and its nearest samples are denoted as fmq14 where S1 is the most similar neighbor and S2 is the second one, etc. Each face is represented by a linear combination of its nearest neighbors. The weight for each neighbor is adopted as the similarity of the target face to neighbor face. L2-norm is used in the similarity metric because L1-norm results in a worse performance and is far less efficient:

fmq1

We treat faces with different poses in the same way since the database is large enough and faces will find neighbor faces with the same pose. With the coefficients bij, we can generate a voting distribution over the identity list aij:

fmq2

To measure the reliability of the voting, we use the sparse concentration index  as confidence scores:

fmq3

In order to fuse fmq17 to label samples Xij, we use the formula fmq9We define weighting function fmq5 where c2 is the part that magnifies votes with large confidence scores and vjk are fixed parameters need to learn. It means that when the confidence score is not high, the vote weight is lower.

Learning voting weights for features with structured output SVM

The standard structured output SVM primal formulation is given as follows:

fmq6

The voting weight w is the stack of vectors vj. To learn w, we define fmq18, where fmq19 is a vector with only y-th row 1, which selects features for class y. And fmq20 maps a track to a matrix with confidences for different identities:

fmq7

Learning a structured output SVM with kernel fmq15 defined above will result in weight vectors that best combine multi-view features in face track recognition. To vote the identity label for a track X, we use a formula as follows:

fmq8

Fusing samples for the track label

One simply way to fuse different samples Xi is to use all identity distributions fmq16 in computing fmq15. However, there are mismatches because many samples are very similar and they all match to faces with wrong identities. In order to avoid these mismatches, we adopt a diversity-sampling algorithm GRASSHOPPER to select diverse samples. We define the similarity function for GRASSHOPPER:

fmq10

where fmq11 are the most similar neighbor of fmq12.

Finally the label of the face track X is computed using the formula:

fmq13

Experiments show that, with sufficiently large face databases, the precision of automatic track tagging would be as high as 95 percent when annotating 80 percent of the face tracks. For some high-quality episodes, the system is able to annotate 90 percent of face tracks with 95 percent accuracy. This significantly reduces the time required for manual confirmation.

After automatic tagging, the face tracks are clustered with respect to visual similarity and presented to human annotators for verification. The corrected labels are fed back into the system to further improve the tagging accuracy.

Cold start

The cold start phenomenon is frequently discussed in the research for recommendation systems. Due to lack of information for a newcomer to the system, no cue is available for deciding which items to recommend. Similarly, when a new show or a new actor comes into our system, we have no labeled information, and thus supervised learning is not feasible. In such a situation, we resort to unsupervised/semi-supervised learning approaches to provide the initial labels for a few tracks to the system.

Simple unsupervised hierarchical clustering is possible, but we can do better than that. Though we do not have label information for a new show or a new actor, we do have labels for other actors in other shows. Thus, with a few pre-built classifiers for each of the known actors, we construct a similarity vector to measure the similarities of the current track to the given set of known actors. See Figure 2 for details, where the small graph illustrates an example of one track's classification scores to a list of known actors. Arguably, this similarity vector encodes some prior knowledge in the system, so we expect this semi-supervised learning scheme will outperform the unsupervised scheme. Experimental results show that the semi-supervised scheme increases 30 percent of the purity score for the clusters over the unsupervised scheme.

fm42

Figure 2. Computing track similarities (with respect to known actors) for face track clustering

Lessons learned

  • Combining face features and context features for hair and clothes improves annotation accuracy.
  • The online active learning scheme shows better results than offline ones.
  • Confirmation is an easier and faster task than annotation for humans. More accurate prediction results help a lot in reducing confirmation time.
  • Grouping visually similar tracks together for confirmation lightens manual workload and significantly reduces human reaction time.
  • The semi-supervised scheme helps solve the cold start problem, and therefore helps annotation.

Our exploration is a preliminary investigation of the track-tagging problem. This is an interesting open research problem and we will continue to improve the annotation accuracy.

This is the 4th blog of Face Match tech blog series. You can browse the other 3 blogs in this series by visiting:

1. Face Match System – Overview

2. Face Match System – Face Detection

3. Face Match System – Shot Boundary Detection and Face Tracking