Motivation
We must confess: sometimes even we have a hard time recognizing actors in TV shows and movies. Sometimes the name is right on the tip of our tongues, but we still don't know it. It's even more difficult with some foreign actors. But, if there was a way for a video to provide detailed metadata about an actor whenever he or she pops up in a video, Hulu users would have the benefit of having that information displayed from the Hulu player, with the option to learn more about the actor they're interested in whenever they wanted.
From another point of view, general multimedia content analysis remains an unsolved problem — even with significant progress made in the past 20 years. However, unlike general content analysis, face-related technologies like face detection, tracking and recognition have recently matured into consumer products. The combination of these types of advances in technology with our relentless pursuit to enhance the user experience at Hulu, is where the idea of "Face Match" originated.
System design
When first examining the problem, one solution would be to examine all frames of the video and use human effort to exhaustively annotate all the faces that appear in these frames. However, this method would not be scalable for billions of videos on the Internet. Another extreme would be to detect faces in each image and let an algorithm automatically detect and identify the faces. However, the bottleneck of this approach is that current recognition algorithms can only achieve approximately 80% accuracy at best — which is far below the minimal user expectation. Taking both of these methods into account, it became apparent that the best solution would be to combine the merits of each and find a way to minimize the human effort to the lowest level.
Our system was designed to carefully balance the computational complexity while also minimizing human effort. As shown in Figure 1, the Face Match platform contains two main parts: the initial segment and the auto-tag-cluster-confirm cycle. For each video, faces are detected and grouped into face tracks/groups. The details of these technologies are described in the next paragraphs.
To minimize the amount of human effort required to label each individual face, visually similar face tracks are grouped via clustering. Thus, a human can select a number of face tracks at a given time and label all of them in one fell swoop.
For each show, the system first collects celebrity information from the web. Then, for initial videos in each show, 20 percent of face tracks are clustered and left for manual labeling. These bootstrapped celebrity labels are helpful in supervised track tagging. Though all face tracks can be clustered and simply left for manual labeling, this leads to a heavy workload. To improve the efficiency of human annotation, we've introduced an auto-tag-cluster-confirm cycle. With the bootstrap labels, the system can learn predictive models for celebrities. The models predict unlabeled tracks that are left for human confirmation. As the pool of celebrity labels grows with each iterative cycle, the system is able to learn face models with better precision. In the front end, displaying a large number of a celebrity's face tracks for manual confirmation would be inefficient since a human still needs seconds to verify each face track. Similar to the initial annotation process, the system also clusters visually similar face tracks together. Thus, humans can confirm a number of tracks in one simple click, with one quick glance.
fm_fig1Figure 1. Overview of the system design. A.) Face groups/tracks are detected and extracted for each video; B.) For each show, celebrity information is collected and the initial pool of face tracks (20 percent) are clustered for bootstrap labels by user annotation; C.) Automatic face track tagging is introduced in auto-tag-cluster-confirm cycle to minimize the human effort.
To detect and connect faces and place them into tracks, we leverage face detection and tracking algorithms. We've also trained a multi-view face detector for 180-degree plane rotation and 180-degree yaw changes with about ten thousand labeled examples. The face detection algorithm needs roughly 300 milliseconds to process a 640x360 frame (running on PC with Xeon(R) CPU E5420 at 2.50GHz). Thus, detecting all video frames would consume nine times the amount of real-time processing — which is unacceptably slow. Our tracking system rescues us from such heavy computation and associates isolated faces into continuous tracks. It can also extend face tracks to frames well beyond the initial detection result, which increases the whole system recall at moments when the detector fails to detect existing faces. This also effectively reduces the number of candidates for later face tagging by a factor of 100. As a result, we only need to tag face tracks and not isolated faces. To avoid the "drifting away" phenomenon in tracking, shot boundaries are detected and incorporated as well.
In automatic track tagging, we also take advantage of the multi-sample (multiple faces per face track), multi-view features (clothes, hair, facial and contextual features). As in Figure 2, where the pipeline of automatic track tagging is shown, the system first builds databases with annotated tracks. Then for each face track, the system extracts multi-view features for all samples in the track. And, for each face and each feature, the system finds its nearest neighbors via ANN and decomposes it as a linear combination of its neighbors. Finally, the identity coefficient distributions for all faces and all features are aggregated for the final tagging of this track. For the details of the algorithm, please refer to the Track Tagging section.

fm_fig2 Figure 2. Algorithm pipeline for the track tagging method

Processing pipeline
As shown in Figure 1a), when a video is about to be released, the system first determines the shot boundaries and densely sample frames to apply the multi-view face detector to each frame. The detected faces provide a starting point for face tracking. Then, tracking algorithms associate the isolated single faces into connected face groups of the same person. With the extracted tracks, clustering algorithms group similar tracks for user annotation. Or, a track tagging stage can also automatically tag actor candidates on each track for user confirmation. Finally, the face tracks are left for human annotation or confirmation.
Combining these steps altogether, we can automatically tag tracks of one video in real-time. For a common TV show, 80 percent of the face tracks can be successfully picked out for processing with 5 percent false positive samples. After that, we would still require human intervention to verify the final results.
In the next three blogs, four core techniques — face detection, face tracking with shot boundary detection, face track clustering, and face track recognition — will be introduced. The annotation step is omitted since this blog covers only technical algorithms.