Following the workflow of the Face Match system, this blog entry introduces the second core technique: face tracking with shot boundary detection.
Shot boundary detection
What is shot boundary detection?
A video is usually composed of hundreds of shots strung into a single file. A shot is composed of continuous frames that are captured in one camera action. Shot boundary detection is used to locate the accurate boundary between two adjacent shots. There are several kinds of boundaries between two adjacent shots, but they can generally be categorized into two types: abrupt transition (CUT) and gradual transition (GT). CUT is usually easy to detect since the change on the boundary is great. Considering the characteristics of different editing effects, GT can be further divided into dissolve, wipe, fade out/in (FOI), and so forth. For GT, there is a smooth transition from one shot to another, which makes it more difficult to determine the position of the boundary. Additionally, it can be difficult to tell the difference between GT and fast movements in a single shot, since the variation of content in both cases is smooth.
Why is shot boundary detection needed?
Shot boundary detection is widely useful in video processing. It is a preliminary technique that can help us to divide a long and complex video into relatively short and simple segments.
In Face Match, shots are the basic units for face tracking as they provide an effective tool to restrict a face track that may drift across multiple shots.
How do you achieve shot boundary detection?
Three steps are required for shot boundary detection:
1. Extract features to represent the video content.
To find the shot boundary, the video is analyzed frame-by-frame. The color vector composed of color values of all pixels in a frame is not good enough to determine a shot change since it's very sensitive to movement and illumination. Therefore, histogram features for both colors in the HSV color space and textures with the local binary pattern (LBP) descriptor are extracted. LBP reflects a local geometric structure and is less sensitive to variations in global illumination.
2. Compute the measurement of continuity.
Continuity measures the similarity between adjacent frames. On the shot boundary, the continuity should have a low value. Using this measurement, content within a video can be transformed into a one-dimensional temporal signal. If the measurement is only associated with two adjacent frames, it is hard to detect the GT since the variation between two adjacent frames is small. Thus, a larger time window is used. In this window, K frames lay along the time axis. See Figure 1 below for an example. Their all-pair similarity can be computed. A graph can be constructed by these K frames with K*(K-1) edges valued by the similarity, as demonstrated below. We've adopted histogram intersection as the similarity measure, weighted by the distance between two frames in a pair.
fm_fig3_1
Figure 1. The graph with K*(K-1) edges (only part of edges are shown) and the K*K weight matrix
The normalized cut CN of this graph is calculated as the continuity value of the middle frame in this window where
fm_equ3_1
Since color and LBP histograms are both employed, two curves can be obtained. The results are combined by multiplying them together.
3. Decide the position (and type) of the shot boundary.
There are two approaches to determine the shot boundary. The first uses a pre-defined threshold to classify the curve into two categories. The second relies on machine-learning techniques to train a classifier. As we lack enough training data, we selected the first approach.
Face Tracking
What is face tracking?
Face tracking is the tracking of the human face in a video or a continuous image sequence from a start point (with parameters such as position, scale, rotation, expression, etc.) given by face detection and even face alignment techniques (Figure 2).
Face tracking may be implemented online or offline. In online mode, a face is tracked while the video is being captured. Thus, only current and previous frames can be used to exploit information for tracking and the efficiency requirement is strict. In offline mode, the whole video file is generated ahead of time. Therefore, the information of any frame can be used to guide the tracking.
In Face Match, since the video has been obtained beforehand, we implement tracking in offline mode, and only the position and scale of the face are concerned.
fm_fig3_2
Figure 2. Illustration of face tracking
Why is face tracking needed?
Video is generally composed of tens of thousands of frames. To find as many faces as possible in each frame, one option is to perform face detection frame-by-frame. Given that it takes 0.3 seconds for a frame sized 640x360, processing a video is more than eight times slower than video playback. Thus, it is not feasible in practice.
Considering the continuity of video along the time axis and the redundancy between adjacent frames, face tracking can be employed instead of face detection in each frame. Since face tracking is very efficient, the time cost can be significantly reduced. Moreover, the faces of the same person in consecutive frames can be linked together. Thus, for each face track, only representative face samples are needed in subsequent face clustering or tagging steps, which can dramatically decrease processing time. Moreover, face tracking can help recover more difficult-to-detect faces.
How do you achieve face tracking?
There are several mature standard models designed for object tracking, such as optical flow, mean shift and particle filter. Considering the efficiency of processing thousands of videos, we adopted the optical-flow based tracker. In order to do so, we follow the Kanade–Lucas–Tomasi tracker, which is based on the object appearance and nonlinear, least-square optimization. If the appearance of the object changes only slightly over time, the tracking performance will be very good. It's also able to handle many motion parameters in addition to transition and scale, 3D rotation angles and expression parameters (e.g. Active appearance models). By adopting inverse compositional techniques, the solving process of optical flow is very efficient.
Optical flow based tracker makes use of continuity of adjacent frames with three assumptions:
Assume the appearance of the target object is similar or the same in adjacent frames
Assume the target object should have abundant texture
Assume the variation of pose parameters (translation, scaling, rotation) should be small
For face tracking in a video stream, these three assumptions are usually satisfied.
Given a face box in the first frame, optical flow minimizes the appearance difference between face areas in adjacent frames to find the best face box in the next frame. The parameters to describe a face box in our application are translation and scale. To solve a non-linear, least-square problem, the parameters can be obtained iteratively. Some further considerations are:
To alleviate the sensitivity of illumination, we normalize the intensity of gradients  fm_equ3_3 as appearance descriptor fm_equ3_4, since it is also simply computed. The original intensity of gradients is normalized by a sigmoid function to limit its dynamic range in [0, 1].

fm_equ3_2

To cover large displacement of face both in and out of the image plane, a multi-resolution strategy with pyramid structure is employed.
Two-step tracking strategy is proposed: 1) track only the translation of the face area using pyramid structure; 2) track translation and scale synchronously in single resolution.
To avoid the track as it drifts into background, an online learning model is adopted in the second step above. Each pixel in the appearance of face area is modeled as Gaussian distribution with the mean and variance updated during the tracking. If the track error is greater than a pre-defined threshold, the tracking is terminated.
The preprocessing is face detection and shot boundary detection. Face detection provides a start for face tracking and shot boundary detection limits the face tracks laid in the same shot. Thus, before tracking, in each shot, there are several detected face boxes in different frames. We iteratively associate the detected faces into longer tracks and extend the connected tracks with further tracking. This finishes the step of tracking.