• Hulu
  • TV
  • Movies
  • More TV. On more devices.
Search
Hulu Tech Blog

Face Match System – Clustering, Recognition and Summary

May 4th, 2014 by Cailiang Liu

Following the workflow of the Face Match system, this blog entry introduces the third core technique: face track clustering and recognition.

Track tagging

When a series of face tracks have been extracted from a set of videos, the next step is to tag them automatically with some probable actor names from the given show. After all, manually processing all the tracks from scratch would be infeasible. The tags, with some sort of acceptable accuracy rate — let’s say 80 percent — provide valuable cues for a person to verify the tracks in groups. When presented in a user-friendly interface, the tags also improve the speed required to correct erroneous matches. Given that, we are seeking ways to improve tagging accuracy for face tracks. This naturally falls into the machine-learning framework, which is widely adopted in the computer vision research community. In this blog entry, we refer to this problem of automatically annotating faces (not tracks) as face tagging.

Traditionally, face verification technology tries to identify whether a given image belongs to a specific person in a set of candidates. Though successfully applied in controlled environments, the approach has strict assumptions: the video must have good lighting, the actors must be facing the camera, and their faces cannot be obscured. However, these assumptions do not hold in challenging environments such as TV shows or movies. The general research interest has recently turned toward solving for uncontrolled datasets like “Labeled Faces in the Wild”. And the benchmark of identifying whether two faces belong to the same person has attracted a lot of attention. However, the LFW database contains a lot of person with only two face samples. Thus, the benchmark could hardly cover the case of identifying many people in many poses and true wild environment.

In the machine-learning framework, the problem of track tagging essentially boils down to how to construct a proper track similarity function as a function of the similarity of the faces in the tracks. Because we are facing the largest dataset for face verification in research history, the time and labor for human verification have become the most critical metrics. Only by improving the accuracy of track tagging can we significantly reduce the time and labor for verification. There are a few aspects that impact the results: 1) The features of the toolset; 2) The learning approach; 3) The cold start problem. Still, because of the very large dataset, we are also constrained by the amount of processing time we can afford. Given the potential number of all videos available to Hulu, we need to reduce the processing time to less than one second per face image. Thus, we cannot afford the recent effective, yet heavy, methods such as dense local feature based methods. Next, we will elaborate on each of these aspects.

Features extraction

In the current system, we leverage quite a few kinds of visual information to improve the tagging accuracy. Compared with a single image, we are equipped with the temporal information provided by continuous face tracks. Fusing these tracks into a 3-D face model is an interesting alternative for us to explore in the future. For now, we’ve limited ourselves to select a few representative faces and have constructed the track similarity function as a function of the similarity of the representative faces.

First, we resize the image to ensure the face region is 80×80 pixels. Then we enlarge the selected region to 80×160 pixels by extending 40 pixels up and 40 pixels down, respectively. See Figure 1 for an example.

Standard face features such as global face features and LBP (local binary pattern) facial features are extracted on the global region and local regions respectively. The global face feature is extracted on the aligned faces with a 16×16 grid layout, with each grid containing a 58-dim LBP feature. The LBP facial features are extracted on each local facial window with a 4×4 grid layout and a 58-dim histogram is accumulated for each grid by different LBP code. These local histograms are concatenated into a vector of 928 dims.

A few face verification approaches require face alignment and face warp as a preprocessing step. The alignment process identifies landmark points in the face, e.g. the corners of the eye, the mouth and the nose. Then the face can be warped to the frontal position by triangulating the facial features and finding the affine mapping. Therefore, global face features can be extracted on distorted faces as well. However, in our experiments, we did not see much improvement using this step. This may due to the fragility of the alignment algorithm we used.

We assume that the given character’s appearance will not change often in one video. So we further incorporate a few other features to reflect the character’s appearance, including hair and face, as well as the environment in which he or she appears. More specifically, we extract texture and color features in respective areas of the face image to reflect hair and scenery. The LBP feature is also extracted on the full 80×160 region to represent the face as a whole. The importance weights among different modalities are learned afterward with some label information for face tracks.

fm41

Figure 1. Feature extraction for face tracks

Learning approach

The primary goal in this step is to construct a proper track similarity function as a function of the similarity of the underlying faces across the tracks.

Given a new video, the tracks of an actor usually will be more similar to tracks from the same actor in this specific video than tracks of the same actor from other videos. This is because the appearance of the actor will remain the mostly same in one given video. Thus label information from the current video is more valuable than that from other videos. With these labels, we can expect higher tagging accuracy, so we adopt an online learning scheme to incorporate the newly verified track labels from the current video at the earliest time.

As we need to handle several tens of thousands of actors in our system, building and maintaining a supervised model for all possible actors is infeasible, even though we need to deal with only 100 to 500 actors for a given show. Given online learning and a huge number of candidates, we adopt a k-Nearest Neighbor (kNN) based lazy-learning approach to annotate the faces independently and then vote among the face tags to determine the tag for the given track. The merit of such lazy learning is that we do not need to maintain any learned model and the newly acquired labels can be added instantly. As shown in Figure 2, after feature extraction, approximate kNN scheme is used to speed up the neighbor-finding process. For a face track X, the jth feature of ith face in X is denoted as Xij, and its nearest samples are denoted as fmq14 where S1 is the most similar neighbor and S2 is the second one, etc. Each face is represented by a linear combination of its nearest neighbors. The weight for each neighbor is adopted as the similarity of the target face to neighbor face. L2-norm is used in the similarity metric because L1-norm results in a worse performance and is far less efficient:

fmq1

We treat faces with different poses in the same way since the database is large enough and faces will find neighbor faces with the same pose. With the coefficients bij, we can generate a voting distribution over the identity list aij:

fmq2

To measure the reliability of the voting, we use the sparse concentration index  as confidence scores:

fmq3

In order to fuse fmq17 to label samples Xij, we use the formula fmq9We define weighting function fmq5 where c2 is the part that magnifies votes with large confidence scores and vjk are fixed parameters need to learn. It means that when the confidence score is not high, the vote weight is lower.

Learning voting weights for features with structured output SVM

The standard structured output SVM primal formulation is given as follows:

fmq6

The voting weight w is the stack of vectors vj. To learn w, we define fmq18, where fmq19 is a vector with only y-th row 1, which selects features for class y. And fmq20 maps a track to a matrix with confidences for different identities:

fmq7

Learning a structured output SVM with kernel fmq15 defined above will result in weight vectors that best combine multi-view features in face track recognition. To vote the identity label for a track X, we use a formula as follows:

fmq8

Fusing samples for the track label

One simply way to fuse different samples Xi is to use all identity distributions fmq16 in computing fmq15. However, there are mismatches because many samples are very similar and they all match to faces with wrong identities. In order to avoid these mismatches, we adopt a diversity-sampling algorithm GRASSHOPPER to select diverse samples. We define the similarity function for GRASSHOPPER:

fmq10

where fmq11 are the most similar neighbor of fmq12.

Finally the label of the face track X is computed using the formula:

fmq13

Experiments show that, with sufficiently large face databases, the precision of automatic track tagging would be as high as 95 percent when annotating 80 percent of the face tracks. For some high-quality episodes, the system is able to annotate 90 percent of face tracks with 95 percent accuracy. This significantly reduces the time required for manual confirmation.

After automatic tagging, the face tracks are clustered with respect to visual similarity and presented to human annotators for verification. The corrected labels are fed back into the system to further improve the tagging accuracy.

Cold start

The cold start phenomenon is frequently discussed in the research for recommendation systems. Due to lack of information for a newcomer to the system, no cue is available for deciding which items to recommend. Similarly, when a new show or a new actor comes into our system, we have no labeled information, and thus supervised learning is not feasible. In such a situation, we resort to unsupervised/semi-supervised learning approaches to provide the initial labels for a few tracks to the system.

Simple unsupervised hierarchical clustering is possible, but we can do better than that. Though we do not have label information for a new show or a new actor, we do have labels for other actors in other shows. Thus, with a few pre-built classifiers for each of the known actors, we construct a similarity vector to measure the similarities of the current track to the given set of known actors. See Figure 2 for details, where the small graph illustrates an example of one track’s classification scores to a list of known actors. Arguably, this similarity vector encodes some prior knowledge in the system, so we expect this semi-supervised learning scheme will outperform the unsupervised scheme. Experimental results show that the semi-supervised scheme increases 30 percent of the purity score for the clusters over the unsupervised scheme.

fm42

Figure 2. Computing track similarities (with respect to known actors) for face track clustering

Lessons learned

  • Combining face features and context features for hair and clothes improves annotation accuracy.
  • The online active learning scheme shows better results than offline ones.
  • Confirmation is an easier and faster task than annotation for humans. More accurate prediction results help a lot in reducing confirmation time.
  • Grouping visually similar tracks together for confirmation lightens manual workload and significantly reduces human reaction time.
  • The semi-supervised scheme helps solve the cold start problem, and therefore helps annotation.

Our exploration is a preliminary investigation of the track-tagging problem. This is an interesting open research problem and we will continue to improve the annotation accuracy.

This is the 4th blog of Face Match tech blog series. You can browse the other 3 blogs in this series by visiting:

1. Face Match System – Overview

2. Face Match System – Face Detection

3. Face Match System – Shot Boundary Detection and Face Tracking

Face Match System – Shot Boundary Detection and Face Tracking

May 3rd, 2014 by Tao Xiong

Following the workflow of the Face Match system, this blog entry introduces the second core technique: face tracking with shot boundary detection.

Shot boundary detection

What is shot boundary detection?

A video is usually composed of hundreds of shots strung into a single file. A shot is composed of continuous frames that are captured in one camera action. Shot boundary detection is used to locate the accurate boundary between two adjacent shots. There are several kinds of boundaries between two adjacent shots, but they can generally be categorized into two types: abrupt transition (CUT) and gradual transition (GT). CUT is usually easy to detect since the change on the boundary is great. Considering the characteristics of different editing effects, GT can be further divided into dissolve, wipe, fade out/in (FOI), and so forth. For GT, there is a smooth transition from one shot to another, which makes it more difficult to determine the position of the boundary. Additionally, it can be difficult to tell the difference between GT and fast movements in a single shot, since the variation of content in both cases is smooth.

Why is shot boundary detection needed?

Shot boundary detection is widely useful in video processing. It is a preliminary technique that can help us to divide a long and complex video into relatively short and simple segments.

In Face Match, shots are the basic units for face tracking as they provide an effective tool to restrict a face track that may drift across multiple shots.

How do you achieve shot boundary detection?

Three steps are required for shot boundary detection:

1. Extract features to represent the video content.

To find the shot boundary, the video is analyzed frame-by-frame. The color vector composed of color values of all pixels in a frame is not good enough to determine a shot change since it’s very sensitive to movement and illumination. Therefore, histogram features for both colors in the HSV color space and textures with the local binary pattern (LBP) descriptor are extracted. LBP reflects a local geometric structure and is less sensitive to variations in global illumination.

2. Compute the measurement of continuity.

Continuity measures the similarity between adjacent frames. On the shot boundary, the continuity should have a low value. Using this measurement, content within a video can be transformed into a one-dimensional temporal signal. If the measurement is only associated with two adjacent frames, it is hard to detect the GT since the variation between two adjacent frames is small. Thus, a larger time window is used. In this window, K frames lay along the time axis. See Figure 1 below for an example. Their all-pair similarity can be computed. A graph can be constructed by these K frames with K*(K-1) edges valued by the similarity, as demonstrated below. We’ve adopted histogram intersection as the similarity measure, weighted by the distance between two frames in a pair.

fm_fig3_1

Figure 1. The graph with K*(K-1) edges (only part of edges are shown) and the K*K weight matrix

The normalized cut CN of this graph is calculated as the continuity value of the middle frame in this window where

fm_equ3_1

Since color and LBP histograms are both employed, two curves can be obtained. The results are combined by multiplying them together.

3. Decide the position (and type) of the shot boundary.

There are two approaches to determine the shot boundary. The first uses a pre-defined threshold to classify the curve into two categories. The second relies on machine-learning techniques to train a classifier. As we lack enough training data, we selected the first approach.

Face Tracking

What is face tracking?

Face tracking is the tracking of the human face in a video or a continuous image sequence from a start point (with parameters such as position, scale, rotation, expression, etc.) given by face detection and even face alignment techniques (Figure 2).

Face tracking may be implemented online or offline. In online mode, a face is tracked while the video is being captured. Thus, only current and previous frames can be used to exploit information for tracking and the efficiency requirement is strict. In offline mode, the whole video file is generated ahead of time. Therefore, the information of any frame can be used to guide the tracking.

In Face Match, since the video has been obtained beforehand, we implement tracking in offline mode, and only the position and scale of the face are concerned.

fm_fig3_2

Figure 2. Illustration of face tracking

Why is face tracking needed?

Video is generally composed of tens of thousands of frames. To find as many faces as possible in each frame, one option is to perform face detection frame-by-frame. Given that it takes 0.3 seconds for a frame sized 640×360, processing a video is more than eight times slower than video playback. Thus, it is not feasible in practice.

Considering the continuity of video along the time axis and the redundancy between adjacent frames, face tracking can be employed instead of face detection in each frame. Since face tracking is very efficient, the time cost can be significantly reduced. Moreover, the faces of the same person in consecutive frames can be linked together. Thus, for each face track, only representative face samples are needed in subsequent face clustering or tagging steps, which can dramatically decrease processing time. Moreover, face tracking can help recover more difficult-to-detect faces.

How do you achieve face tracking?

There are several mature standard models designed for object tracking, such as optical flow, mean shift and particle filter. Considering the efficiency of processing thousands of videos, we adopted the optical-flow based tracker. In order to do so, we follow the Kanade–Lucas–Tomasi tracker, which is based on the object appearance and nonlinear, least-square optimization. If the appearance of the object changes only slightly over time, the tracking performance will be very good. It’s also able to handle many motion parameters in addition to transition and scale, 3D rotation angles and expression parameters (e.g. Active appearance models). By adopting inverse compositional techniques, the solving process of optical flow is very efficient.

Optical flow based tracker makes use of continuity of adjacent frames with three assumptions:

  1. Assume the appearance of the target object is similar or the same in adjacent frames
  2. Assume the target object should have abundant texture
  3. Assume the variation of pose parameters (translation, scaling, rotation) should be small

For face tracking in a video stream, these three assumptions are usually satisfied.

Given a face box in the first frame, optical flow minimizes the appearance difference between face areas in adjacent frames to find the best face box in the next frame. The parameters to describe a face box in our application are translation and scale. To solve a non-linear, least-square problem, the parameters can be obtained iteratively. Some further considerations are:

  • To alleviate the sensitivity of illumination, we normalize the intensity of gradients  fm_equ3_3 as appearance descriptor fm_equ3_4, since it is also simply computed. The original intensity of gradients is normalized by a sigmoid function to limit its dynamic range in [0, 1].

fm_equ3_2

  • To cover large displacement of face both in and out of the image plane, a multi-resolution strategy with pyramid structure is employed.
  • Two-step tracking strategy is proposed: 1) track only the translation of the face area using pyramid structure; 2) track translation and scale synchronously in single resolution.
  • To avoid the track as it drifts into background, an online learning model is adopted in the second step above. Each pixel in the appearance of face area is modeled as Gaussian distribution with the mean and variance updated during the tracking. If the track error is greater than a pre-defined threshold, the tracking is terminated.

The preprocessing is face detection and shot boundary detection. Face detection provides a start for face tracking and shot boundary detection limits the face tracks laid in the same shot. Thus, before tracking, in each shot, there are several detected face boxes in different frames. We iteratively associate the detected faces into longer tracks and extend the connected tracks with further tracking. This finishes the step of tracking.

This is the 3rd blog of Face Match tech blog series. You can browse the other 3 blogs in this series by visiting:

1. Face Match System – Overview

2. Face Match System – Face Detection

4. Face Match System – Clustering, Recognition and Summary

Face Match System – Face Detection

May 3rd, 2014 by Tao Xiong

Following the workflow of the Face Match system, this blog entry introduces the first core technique: face detection.

Face detection

How does the system identify which faces to detect?

Face detection is an essential step in face tagging. The detection rate strongly correlates with the final system recall of faces. We take careful steps to detect frontal faces as well as profile faces because the latter are indispensable for recalling whole-profile face tracks, which are abundant in premium videos. The detection of rotated faces is also a necessity. See Figure 1 below for an illustration of face poses we strive to detect with our detector, where yaw refers to different profile degrees ranging from -90 to 90 degrees, and rotation refers to in-plane rotation from -90 to 90 degrees. We do not run the full range of in-plane rotation due to efficiency considerations.

fm_fig2_1

Figure 1. Out-plane and in-plane rotations of human face

Incorporating such variances in the detector complicates the architecture design. We need to carefully design the algorithm and parameters to achieve balance among accuracy, false detection rates and running speed. Please remember that the detector is the most time-consuming feature in the whole system.

Building a multi-view face detector

Face detection is a well-studied problem with a long research tradition. The state-of-the-art detector follows the sliding window approach to exhaustively scan all possible sub-windows in one image, and it relies on cascade-boosting architecture to quickly filter out negative examples. See Figure 2 for an illustration of the cascaded classifiers. Each stage (denoted as 1, 2, 3, etc.) is a classifier, which scores the sub-windows. The windows with scores below a certain threshold are discarded and only those with larger scores are passed. Thus with carefully designed classifiers, we can safely filter a certain portion of negative examples without falsely rejecting many truly positive examples. Though the number of sub-windows is huge for an image, most of the windows are negative examples and will run through only one or two stages. Thus the process is quite efficient for a single-face pose.

 fm_fig2_2

Figure 2. Cascade classifiers

However, parallel processing the different detectors for multiple poses ignores the structure of the face pose space and is inefficient. To facilitate the feature- and detector-sharing among different poses, various hierarchical detector structures have been proposed and implemented. We chose the pyramid structure for its simple and independent training process for the underlying component detectors. Pyramid structure is a coarse-to-fine partition of multi-view faces. See the following Figure 3 for an illustration of the yaw based partition process.

 fm_fig2_3

 Figure 3. Partition process of yaw angle

Our situation is a bit more complex since we need to deal with in-plane rotation and yaw rotation at the same time. Thus a branching node is needed to decide whether a given example will go to the in-plane rotation branch or the yaw rotation branch (Figure 4). More specifically, we train a five-stage all-pose face/non-face detector as the root node. Then we train two detectors for in-plane rotation and yaw rotation respectively, each with ten stages. The outputs of these two detectors are compared to select a subsequent branch hereafter. After that, the problem is converted to the solved problem of rotation in one dimension, be it in-plane rotation or yaw rotation. In a given branch, the same coarse-to-fine strategy is used. The final output incorporates both face position and pose estimation.

fm_fig2_4

Figure 4. Whole face detector structure to handle multi-view faces

Usually for face detectors, Haar wavelet features are typically used in face detection because they are simple and fast. However, they often contain dimensions ranging in the tens of thousands. In contrast, the local binary pattern (LBP) feature is only a 58-bin sparse histogram. It captures the local image’s geometric structure and is less sensitive to global illumination variations. Thus, we’ve adopted the LBP histogram for our system.

We’ve also integrated the boosting framework for training the classifier stages. We use a RankBoost like reweighting scheme in each round to balance the weights for positive and negative examples. This is useful to tune the classifiers to focus more on the limited positive examples. We also follow the nested cascade structure to further reduce the number of weak classifiers needed in the detector.

Synthetic examples like flipped, rotated version of faces with random small positions, scale and rotational transformations are created to enlarge the face dataset. In training, multiple threading techniques make the process more quickly.

Our multi-view face detector can detect faces in about 300ms for 640×360 images. The accuracy is about 80 percent for frontal faces and 60 percent for profile faces, both at 5 percent false detection rate.

This is the 2nd blog of Face Match tech blog series. You can browse the other 3 blogs in this series by visiting:

1. Face Match System – Overview

3. Face Match System – Shot Boundary Detection and Face Tracking

4. Face Match System – Clustering, Recognition and Summary

Face Match System – Overview

May 3rd, 2014 by Zhibing Wang

Motivation

We must confess: sometimes even we have a hard time recognizing actors in TV shows and movies. Sometimes the name is right on the tip of our tongues, but we still don’t know it. It’s even more difficult with some foreign actors. But, if there was a way for a video to provide detailed metadata about an actor whenever he or she pops up in a video, Hulu users would have the benefit of having that information displayed from the Hulu player, with the option to learn more about the actor they’re interested in whenever they wanted.

From another point of view, general multimedia content analysis remains an unsolved problem — even with significant progress made in the past 20 years. However, unlike general content analysis, face-related technologies like face detection, tracking and recognition have recently matured into consumer products. The combination of these types of advances in technology with our relentless pursuit to enhance the user experience at Hulu, is where the idea of “Face Match” originated.

System design

When first examining the problem, one solution would be to examine all frames of the video and use human effort to exhaustively annotate all the faces that appear in these frames. However, this method would not be scalable for billions of videos on the Internet. Another extreme would be to detect faces in each image and let an algorithm automatically detect and identify the faces. However, the bottleneck of this approach is that current recognition algorithms can only achieve approximately 80% accuracy at best — which is far below the minimal user expectation. Taking both of these methods into account, it became apparent that the best solution would be to combine the merits of each and find a way to minimize the human effort to the lowest level.

Our system was designed to carefully balance the computational complexity while also minimizing human effort. As shown in Figure 1, the Face Match platform contains two main parts: the initial segment and the auto-tag-cluster-confirm cycle. For each video, faces are detected and grouped into face tracks/groups. The details of these technologies are described in the next paragraphs.

To minimize the amount of human effort required to label each individual face, visually similar face tracks are grouped via clustering. Thus, a human can select a number of face tracks at a given time and label all of them in one fell swoop.

For each show, the system first collects celebrity information from the web. Then, for initial videos in each show, 20 percent of face tracks are clustered and left for manual labeling. These bootstrapped celebrity labels are helpful in supervised track tagging. Though all face tracks can be clustered and simply left for manual labeling, this leads to a heavy workload. To improve the efficiency of human annotation, we’ve introduced an auto-tag-cluster-confirm cycle. With the bootstrap labels, the system can learn predictive models for celebrities. The models predict unlabeled tracks that are left for human confirmation. As the pool of celebrity labels grows with each iterative cycle, the system is able to learn face models with better precision. In the front end, displaying a large number of a celebrity’s face tracks for manual confirmation would be inefficient since a human still needs seconds to verify each face track. Similar to the initial annotation process, the system also clusters visually similar face tracks together. Thus, humans can confirm a number of tracks in one simple click, with one quick glance.

fm_fig1Figure 1. Overview of the system design. A.) Face groups/tracks are detected and extracted for each video; B.) For each show, celebrity information is collected and the initial pool of face tracks (20 percent) are clustered for bootstrap labels by user annotation; C.) Automatic face track tagging is introduced in auto-tag-cluster-confirm cycle to minimize the human effort.

To detect and connect faces and place them into tracks, we leverage face detection and tracking algorithms. We’ve also trained a multi-view face detector for 180-degree plane rotation and 180-degree yaw changes with about ten thousand labeled examples. The face detection algorithm needs roughly 300 milliseconds to process a 640×360 frame (running on PC with Xeon(R) CPU E5420 at 2.50GHz). Thus, detecting all video frames would consume nine times the amount of real-time processing — which is unacceptably slow. Our tracking system rescues us from such heavy computation and associates isolated faces into continuous tracks. It can also extend face tracks to frames well beyond the initial detection result, which increases the whole system recall at moments when the detector fails to detect existing faces. This also effectively reduces the number of candidates for later face tagging by a factor of 100. As a result, we only need to tag face tracks and not isolated faces. To avoid the “drifting away” phenomenon in tracking, shot boundaries are detected and incorporated as well.

In automatic track tagging, we also take advantage of the multi-sample (multiple faces per face track), multi-view features (clothes, hair, facial and contextual features). As in Figure 2, where the pipeline of automatic track tagging is shown, the system first builds databases with annotated tracks. Then for each face track, the system extracts multi-view features for all samples in the track. And, for each face and each feature, the system finds its nearest neighbors via ANN and decomposes it as a linear combination of its neighbors. Finally, the identity coefficient distributions for all faces and all features are aggregated for the final tagging of this track. For the details of the algorithm, please refer to the Track Tagging section.

fm_fig2 Figure 2. Algorithm pipeline for the track tagging method

Processing pipeline

As shown in Figure 1a), when a video is about to be released, the system first determines the shot boundaries and densely sample frames to apply the multi-view face detector to each frame. The detected faces provide a starting point for face tracking. Then, tracking algorithms associate the isolated single faces into connected face groups of the same person. With the extracted tracks, clustering algorithms group similar tracks for user annotation. Or, a track tagging stage can also automatically tag actor candidates on each track for user confirmation. Finally, the face tracks are left for human annotation or confirmation.

Combining these steps altogether, we can automatically tag tracks of one video in real-time. For a common TV show, 80 percent of the face tracks can be successfully picked out for processing with 5 percent false positive samples. After that, we would still require human intervention to verify the final results.

In the next three blogs, four core techniques — face detection, face tracking with shot boundary detection, face track clustering, and face track recognition — will be introduced. The annotation step is omitted since this blog covers only technical algorithms.

Last comment: Aug 30th 2014 3 Comments

BeaconSpec: A domain-specific language for generating MapReduce jobs

April 10th, 2014 by Prasan Samtani

Perhaps of all the creations of man, language is the most astonishing

- Giles Latton Strachey, Words and Poetry

Introduction

Hulu viewers generate a tremendous amount of data: our users watch over 400 million videos and 2  billion advertisements a month. Processing and analyzing that data is critical to our business, whether it is for deciding what content to invest in in the future, or to convince external partners on the superiority of Hulu as an advertising platform. In order to ingest, process and understand that data, we rely on a framework called Hadoop (http://hadoop.apache.org/) – an open source project that allows for the storage and distributed processing of large data-sets.

The two major components of Hadoop are a distributed file system (HDFS) and MapReduce – a framework for distributed computing over the cluster. In order to understand the rest of this blogpost, it is important to have a basic understanding of how MapReduce works.

When authoring a MapReduce job, programmers specify two functions – a map function that processes each chunk of data to produce a set of key-value pairs, and a reduce function the merges all values associated with the same key (1). A surprising number of conventional algorithms can be expressed in this manner and thereby be converted into programs that can run in parallel. This is best understood with an example.

Let us imagine that we are trying to compute the total number of minutes watched for a given set of videos. Each video has a unique identifier, and what we would like at the end of this computation to have a table in the following form:

Video identifier
Minutes watched
1 25
2 34
3 126
4 5

In a distributed file system such as HDFS, the raw data will be stored on several separate computers, or nodes. A naive way to compute this data would be for the program to serially request the raw data from each of the nodes, and perform the computation. Such an approach works fine for small data-sets such as this example, however we want an approach that scales to very large data-sets. For example, for any typical hour, we have about 50 GB of raw data on our Hulu cluster. Clearly, sequentially processing all this data would take a very, very long time and consume a lot of network resources. Using MapReduce, the mapper process runs on each node in parallel, generating a key-value pair consisting of the video identifier (the key) and the minutes watched (the value). Illustration 1 shows what the output of the mapping phase would look like if the data was distributed across three nodes.

 

MapReduceExplanationMapper

 

From here, we go into the reduce phase, where all outputs with the same key are guaranteed to be sent to the same reducer. The reducer then computes some function over the intermediate output to produce a final output. In this case the function is summation, as all we want is the total number of minutes per video. Illustration 2 shows what this would look like if this processing ran on two reducers.

MapReduceExplanationReducer

 

Beacons

Event data from our various Hulu players is encoded in the form of beacons. A beacon is just a URL-encoded description of the event, as shown in the example below:

80    2013-04-01    00:00:00    /v3/playback/start?bitrate=650&cdn=Akamai&channel=Anime&client=Explorer&computerguid=EA8FA1000232B8F6986C3E0BE55E9333&contentid=5003673

The Hulu players on all our devices are sending a constant stream of beacons to our servers, those beacons are subsequently stored onto HDFS where they can be processed by our MapReduce jobs.

BeaconSpec

So far, we’ve seen how we can transform a conventional single-threaded computation into a MapReduce computation by specifying a mapper and reducer function. We’ve also seen how we encode events as beacons, and collect them onto our distributed file system, HDFS. However, in our experience, writing MapReduce jobs by hand is both tedious and error prone, although like most skills, you get better at it with practice. Additionally, the resultant code contains a significant amount of boilerplate, making the logic hard to see at first glance. The latter is in practice the most significant impediment to overall system understandability and debuggability. Since we run many (on the order of 150-175) different types of MapReduce jobs every hour, we wanted a solution that would allow us to encode the logic in a straightforward way that was easier to maintain than hand-written Java code.

We started by looking at our code and realizing that the majority of our MapReduce jobs perform very similar functions – selecting a set of dimensions that we care about (for example the video identifier, the zipcode, etc), performing some lookups against meta-data tables in our key-value store (for example deriving the zip code dimension from the IP address of the request) and aggregating over a corresponding fact (for example, the total minutes watched). We realized that we could embed the basic knowledge of how to do these things in a language, and then simply use programs written in this language to generate our MapReduce code for us. Since internally we refer to the events we receive from our players as beacons, and the purpose of this language was to process raw beacon data, we called the language BeaconSpec.

An example of BeaconSpec code is shown below:

basefact playback_start from playback/start {
    dimension harpyhour.id as hourid;
    required dimension video.id as video_id;
    required dimension contentPartner.id as content_partner_id;
    required dimension distributionPartner.id as distribution_partner_id;
    required dimension distributionPlatform.id as distro_platform_id;
    dimension distributionPlatform.isonhulu as is_on_hulu;
    dimension package.id as package_id;
    dimension package.isplusbypackage as is_plus_by_package;
    dimension plan.id as plan_id;
    dimension plan.isplusbyplan as is_plus_by_plan;
    dimension plusCategory.pluslowercategoryid as plus_lower_category_id;
    dimension plusCategory.plusuppercategoryid as plus_higher_category_id;
    dimension client.out as client;
    fact sum(count.count) as total_count;
    dimension packageAvailability.chosen as package_availability;
    dimension siteSessionId.chosen as site_session_id;
    dimension facebook.isfacebookconnected as is_facebook_connected;
}

There are a few key points to note – first, there is no specification of how to compute the final result, only a declaration of what we would like to compute. It is the role of the BeaconSpec compiler to take this declarative specification and convert it into imperative MapReduce code that can run on the cluster. Second, there are a lot of special keywords that have meaning for us – the keyword basefact denotes a metric that we want to measure, the keyword from tells us which source beacons we need to process in order to compute the metric, the keyword dimension denotes a particular dimension of the source beacon that we care about and that should form a part of the intermediate key out of the Mapper phase, and the keyword fact denotes a dimension of the source beacon that we want to aggregate over (in this particular example, we are performing a summation, as the fact keyword is immediately followed by the sum specifier – we could just as easily calculate an average [avg] or take the maximum value [max]).

From this specification, our in-house compiler produces runnable Java MapReduce code, a snippet of which is shown below:

Screen Shot 2014-04-10 at 12.28.34 PM

We use a variety of open-source technologies in order to build our compiler – in particular JFlex for lexical analysis & CUP for parser-generation. These are the Java cousins of the old C programs you probably used if you’ve ever taken a compilers class – lex and yacc.

A major advantage of a formal declarative specification of our process is that it allows us to extend functionality far beyond what we initially planned. For example, we are currently in the process of building out a program whose purpose is to validate beacons sent by implementations of our Hulu player on a range of devices. For this purpose, we can use BeaconSpec as an input to the validation program, which will subsequently examine incoming beacons and compare them to the specification, and send us reports about whether the incoming beacons match or deviate from the specification. As another example, as we move towards real-time processing of our incoming data, we are examining the possibility of creating a secondary code-generator for the BeaconSpec compiler that will output code to run on Apache Storm instead of MapReduce.

Related work

Apache Pig is a project that has similar goals to BeaconSpec. Pig programmers write their scripts in a language called Pig Latin, which is subsequently compiled into MapReduce code. However, unlike BeaconSpec, Pig is both an imperative and general purpose language. We feel that for this particular use case, the advantages conferred by a declarative domain-specific language are too great to consider abandoning them for a general purpose language. An imperative general-purpose language cannot avoid introducing boilerplate and insignificant details cause the final program to be significantly less clear than what we could achieve with a declarative domain-specific language.

Summingbird is a project at Twitter which can generate code targeting MapReduce, Scalding or Storm. It offers a powerful set of abstractions for performing aggregations. An example of the canonical word-count program in Summingbird is shown below:

def wordCount(source: Iterable[String], store: MutableMap[String, Long]) =
   source.flatMap { sentence =>
     toWords(sentence).map(_ -> 1L)
   }.foreach { case (k, v) => store.update(k, store.get(k) + v) }

For an example of the equivalent code written directly in Java MapReduce, see this link. Summingbird was written to solve similar problems to the ones that led us to create BeaconSpec, and we believe it does so in a way that is significantly more expressive than Pig. However, as it is written in a highly idiomatic style, and learning to write Summingbird programs has a steeper learning curve than BeaconSpec.

A few other languages have emerged around the Hadoop ecosystem, such as Kiji (which offers a table abstraction and several insertion/retrieval operators over HBase), and Hive (HiveQL) – which offers a subset of relational database operators that are compiled to MapReduce. We have not fully explored Kiji, however, we make heavy use of Hive at Hulu.

Sawzall (developed at Google) can arguably be acknowledged as the progenitor of all these languages, it is another general-purpose data processing language designed to compile to MapReduce code on Google’s proprietary distributed data processing platform. A link to a paper on it can be found here.

Conclusion

The key takeaway is that we don’t want a general purpose language, we want a language that expresses exactly what we care about, and suppresses details that are not central to the task (2). Whenever you are working on a DSL, adding general purpose features to the language is a serious temptation, but one that must be avoided if you don’t want your project timeline to rival that of Duke Nukem Forever. This sentiment is best captured by the pioneer computer scientist Alan Perlis in the following quote:

Beware of the Turing tar-pit, in which everything is possible, but nothing of interest is ever easy.

-Alan Perlis, Epigrams in Programming

 

References

  1. MapReduce: Simplified data processing on large clusters (Jeffrey Dean and Sanjay Ghemawat, Google Inc., 2004) http://static.googleusercontent.com/media/research.google.com/en/us/archive/mapreduce-osdi04.pdf
  2. Structure and interpretation of computer programs (Hal Abelson and Gerald Jay Sussman, MIT, 1984) http://mitpress.mit.edu/sicp/
  3. Epigrams in Programming (Alan Perlis) http://www.cs.yale.edu/homes/perlis-alan/quotes.html
  4. Interpreting the Data: Parallel Analysis with Sawzall (Pike et al, Google Inc.) http://research.google.com/archive/sawzall.html

Categorizing Customer Support Contacts with Machine Learning

December 17th, 2013 by Chris Liu

Introduction

One of the cool things about being an intern at Hulu is that we are given the freedom to creatively apply our computer science knowledge to everyday problems. Our customer support process generated such a problem: we’ve got a stream of incoming customer support emails, and would like to automatically assign textual tags to them based on their content. This is a machine` learning problem of text classification, and one that we tackle below.

A binary text classification problem is to learn a rule that classifies a concept given a training set. The training set consists of pairs (x_i, y_i) , x_i being a text from some instance space X , and y_i \in (0,1) being a binary label. We assume there exists a true decision function f: X \rightarrow (0,1) where f(x_i) = y_i . Our goal is to find a hypothesis h: X \rightarrow (0,1) that best approximates f .

For example, if we take our instance space to be the set of all sentences in the English language, and y_i = 1 to mean the sentence has positive sentiment, then our problem is to find a classifier that classifies the sentiment of English sentences. This example already illustrates problems we will face in the future — how do we reconcile the fact that the set of all sentences is potentially infinite? How do we represent our sentences in a way that suits learning best? And, most importantly, how do we best learn the true hypothesis h from just our training examples?

Hulu gets thousands of customer support emails daily. Each email is represented by a ticket and associated with a set of concise descriptors called tags. With a well-performing tag classifier, the benefits are tangible — we can save our Customer Support Advocates (CSAs) the need to manually tag tickets, and we can route tickets to specific groups of CSAs who specialize in a particular subject area. In general, we would be able to streamline the support experience.

Our problem comes with a unique set of challenges. We would like our service to run on Donki, Hulu’s internal service hosting platform. Since we’d like to limit memory usage, we’d like to keep our approach bounded to a somewhat conservative 512mb. The size of our training set may be very large, and potentially not bounded. With these challenges in mind, we will describe the approach taken, what worked, what didn’t work, and the end result.

Tools

Given our wish for the service to run on Donki, Python seems like the natural choice. The NumPy/SciPy/Scikit-Learn stack is an accessible, open source, and comprehensive set of libraries that suffices for our purposes.

As with any Machine Learning problem, feature extraction, feature selection and model selection is crucial. Feature extraction decides which tickets are suitable for training, and extracts the useful parts of suitable tickets. This is the process of canonicalizing a ticket. Feature selection turns the canonical ticket extracted above into a training set. Model selection turns the training set into a classifier.

Feature Extraction

We would like our canonicalized ticket to be faithful to the eventual prediction context. This means, for example, that the CSA’s response to the ticket should not be included. Fortunately, all changes to a ticket are recorded in chronological order, so we are able to extract only the text and tags relating to the customer’s first contact.

Feature extraction also relies upon heuristics and domain-specific knowledge of the problem. Some additional tickets fields are also of value. We take the operating system field, the subject field, and whether the ticket was created by a Hulu Plus subscriber. For example, tickets from Hulu Plus subscribers are more likely to be about billing. To incorporate these additional fields, we append them to the existing text. For example, if a user is a Hulu Plus subscriber, we append the word “IsPlusSub”.

Sometimes, manually viewing the tickets and the extracted result is the only way of gaining insight. One particular issue was with users replying to emails would bring in a large amount of highly suggestive words. For example, an automated email from Hulu contained a link allowing users to “unsubscribe” to these emails, which was highly suggestive for certain tags. A heuristic was utilized to filter out all parts of the reply parts of the text. Another amusing example was that certain Hulu emails contained the words “Vizio”, “Samsung”, “Sony”, and other TV platforms on which Hulu Plus is available. An pre-existing rule would activate at these keywords, giving these tickets a large number of irrelevant tags.

Feature Selection

As we hinted before, our instance space X where we draw tickets from may be infinite. One common approach to this problem is the “bag of words” representation. We simplify our instance space from the set of all ticket text to the set of all “suitable” words, which is always finite. The “suitable” criteria is another heuristic, but intuitively, we want to exclude common but meaningless words. For example, the word “I” or “we”, or, in our case, “Hulu”, is meaningless with regards to the tags of a message. Excluding these words assists our model in finding meaningful relationships.

Our revised instance space is often called the vocabulary of our examples. Another optimization we can make is to expand our vocabulary using the technique of “n-grams”, which is a contiguous sequence of words of length n. n-grams can potentially capture relationships between words. For example, a ticket with words “not cancel” and another with “cancel not” would be equivalent in a vocabulary consisting solely of words. However, a 2-gram will capture this difference, since for the first ticket, the vocabulary will contain (not, cancel, not \;cancel) , while the second will contain (cancel, not, cancel\;not) .

In order to use the bag of words representation, we need words. This involves pre-processing and tokenizing the canonicalized tickets. We settled on the n-gram range of (1,2) through experimental testing. This process results in very high dimensional data, since each word and each 2-sequnce of word is a dimension. We attempted dimensionality reduction techniques, to combat the so called “curse of dimensionality”. Dimensionality reduction techniques such as Principal Component Analysis and Latent Semantic Analysis attempt to find linear combinations of data that best correlate. Exploration on this front produced some graphics.

img1

img2

The semantic meaning of an “account_billing” tag and an “account_login_activation” tag is quite similar. Both have to do with a user’s account problems. Hence, when we projected the example down to its two most “information preserving” dimensions, we do not see clear separation. However, “account_billing” and “technology_playback” tags have quite distinct meanings from each other, and we see clear separation after dimensional reduction. Just to confirm our intuition on this, we plot “android” vs “ios”, two tags with distinct meanings, and get the following:

img3

It’s always nice when mathematics are in line with our intuition. While these techniques were not ultimately used due to scalability problems, the ability to visualize and confirm our suspicious regarding separability of tags was highly valuable.

For each document, we want to create a corresponding bag of words representation. The simplest approach is to scan over the set of all documents (also called the corpus), and build a dictionary consisting of the entire vocabulary. If the length of this dictionary is n , then a vector v of length n models a document, where the ith index represents an element of the vocabulary, and the value at the ith index is its frequency in the document. One common technique on top of frequency count is the “inverse document frequency” re-weighing, which intuitively says frequent words in the corpus are inherently less meaningful.

The above described technique is often referred to as term frequency-inverse document frequency vectorizing. The technique requires the entire vocabulary in memory, rendering it stateful. So we explored alternatives for scaling purposes. We settled upon using what is called the “hashing trick”. Very simply, instead of building a vocabulary by scanning the corpus once ahead of time, we explicitly pre-define the output space to be a high dimensional vector space. Then for each word and n gram, we hash and increment that index of the output vector. Because collisions are unlikely in high dimensional space, the hashed vector is a faithful representation of the original text. The one thing we gain in this case is statelessness, so we can use this technique on batches of documents. A brief survey of papers on this subject (here, and here) shows the hashing trick has both theoretical and empirical support.

Model selection

In order to select a model, we need a way to calculate performance and a dataset to calculate performance on. Being a binary classification problem, performance is defined as counts of true positives, true negatives, false positives, and false negatives. How we view these numbers is interesting and worth discussing. In our case, two relevant metrics are “precision” and “recall”. Precision measures the ratio \frac{ \text{true\ positive} } {\text{true\ positive} + \text{false\ positive}} whereas recall measures the ratio \frac{\text{true positive}}{\text{true positive}+ \text{false negative}} . Achieving a high precision means that when we do predict the positive label, we are mostly correct. High recall means we are correctly predicting most of the positive labeled example positive. High precision usually comes at the cost of low recall, and vice versa. For example, if a model does not predict positive on any example except the very few that it’s very sure of, it achieves high precision, but low recall. If a model predict positive on every label, it achieves high recall but low precision.

Because our tool will be used as a pre-processing step in the handling of a ticket, we decided that high precision is more valuable than high recall. We would rather not have the classifier make a positive prediction, unless it is reasonably sure that this prediction is correct.

To get a dataset for testing, we employ cross validation and use a test set. In cross validation, we split the training set into n folds, and for 1 \leq i \leq n , we train on all folds other than the ith fold, and do scoring on the ith fold. We make a test set by holding back a ratio of the training set from training, and to calculate performance on that set after the model is trained. Both methods ensure the set we are scoring on does not participate in the training of the model.

The overall metric we would like to optimize is the F_1 score, which is the harmonic mean of precision and recall. It can be calculated as 2(\frac{\text{precision} + \text{recall}}{\text{precision}\ \cdot\ \text{recall}}) . We survey a broad variety of algorithms for classification, including: Naive Bayes, K-Nearest Neighbor, Support Vector Machine, Stochastic Gradient Decent, Decision Tree, and Random Forests, and AdaBoost. Unfortunately, the Decision Tree, Random Forests, and AdaBoost classifiers cannot handle sparse data, so they were immediately disqualified.

The scores, training, and test times of our different classifier is listed below for a collection of 15,000 training and 6,500 test documents.

Classifier F1 Precision Recall Training time Prediction time
Linear SVM 0.590 0.703 0.509 68.5 s 0.43s
Stochastic Gradient Decent 0.587 0.689 0.512 33.2s 0.37s
Logistic Regression 0.556 0.718 0.454 31.8s 0.43s
Bernoulli Naïve Bayes 0.438 0.353 0.577 2.20s 1.76s
Multinomial Naïve Bayes 0.450 0.361 0.597 33.0s 0.40s
K-nearest neighbors 0.477 0.429 0.538 n/a 446.23s

In the end, the Stochastic Gradient Decent classifier was chosen. It has the desirable combination of speed and performance, and can be tweaked to optimize for a variety of objective functions. Furthermore, it is an online algorithm, meaning the training process can be split up in batches, alleviating memory concerns.

Odds and Ends

There are many tags that are simply unpredictable. Text classification relies upon the label having a clear and distinguishable meaning. For example, we found tags that indicate an action such as giving a customer a credit performed very poorly, since it is hard to gauge from the contents of the text whether a credit will be given. As such, we’ve implemented an automated pruning system into our classifier. We score each tag’s classifier at the end of training on the test set, and if that tag’s performance does not satisfy some minimum, we do not ever predict true for that tag.

Secondly, we may want to partition tags into classes, where at most one tag can be predicted per class. For example, we may wish to avoid the many device tags problem described above and enforce a maximum of 1 device tag per ticket. To solve this problem, we appeal to geometry. Any linear classifier is a hyperplane in the feature space, and can be represented as a weight vector, w . Predictions for an incoming vector x is made via sign(w \cdot x) , since any positive number means this example lies on the positive label side of the hyperplane. However, if we have a tag class C , and a set of classifiers (w_i\;:\; i \in C) , then we can output a prediction \arg \max_i (w_i \cdot x) , provided \max_i (w_i \cdot x) > 0 . This is commonly referred to as the one-vs-all method of extending binary classifiers to predict on a tag class C .

Conclusion

Applying machine learning techniques to real world data is tricky, exciting, and rewarding. Here, we’ve outlined some of the decisions we’ve made, approaches that didn’t work, ones that did, and how the steps of feature extraction, feature selection, and model selection were performed. In the future, we may expand into classifying phone transcriptions, and personalize the classifier for individual CSAs, to further optimize the automation within our customer support platform.

Chris Liu is a software developer intern on our customer platform team.

Hulu iPad redesign: lessons and code

December 11th, 2013 by Bao Lei

Earlier this year we launched a brand new revision of our iPad application. With lots of new features like our show and video discovery panel, mini player, etc, we had to think very hard about speed and stability of both the software we wrote as well as of the development process. With a few months behind us now, we wanted to introduce some of the interesting technical tidbits from our product.

Building the UI

At Hulu, we choose not to use the interface builder for our apps. Traditionally, we’d coded the positions and resizing logic for each view. While this is frequently a lot of labor, we felt it still left us with better control over the results than using a mouse to drag buttons. Using this approach you can keep your views under control when the view is simple, or you’re adding a few small features. With a big redesign, which included many different new pages and components, aligning everything manually with code becomes a lot less fun — not to mention challenging as the design is often tweaked after observing actual on-device user interaction.

There has a be a better approach, and the most interesting one for us was autolayout — which Apple introduced along with iOS 6. While awesome, we ran into two problems: (1) it works for iOS 6 and above, while we still support iOS 5; (2) it’s designed primarily for interface builder users, so it is little harder to use programmatically.

Then we looked at one small interesting part of the autolayout: the ASCII-art-like Visual Formatting Language (VFL). That eventually became our solution: we would use VFL, but not autolayout. Our solution is now released as an open source project called vfl2objc.

How does it work? We built a system to generate Objective-C code based on a VFL-defined layout. This allowed us to retain iOS 5 compatibility while keeping the layout instructions easy to read and easy to modify. The approach is performant because most of computations are done at compile time (except the magic behind the autoresizing mask, but that is relatively simple). It is also very flexible: you can call any sizing or positioning APIs before or after the VFL-generated code block — unlike autolayout, which ignores the frame you manually set, our VFL code generation tool won’t.

For example, you can go the loadView method of a view controller class, and add the following:

// VFL begin
/*
|-30-[label]-30-|
|-50-[button(100)]
V:|-10-[label(20)]-15-[button(50)]
*/
// VFL end
Then, run our vfl2objc script. This will generate the following:

// --- VFL GENERATED CODE ---
/*
|-30-[label]-30-|
|-50-[button(100)]
V:|-10-[label(20)]-15-[button(50)]
*/
{
    // You need to predefine superview before this.
    CGRect frame;
    frame = label.frame;
    frame.origin.x = 0 + 30;
    frame.size.width = superview.bounds.size.width - (0 + 30) - frame.origin.x;
    frame.origin.y = 0 + 10;
    frame.size.height = 20;
    label.frame = frame;
    label.autoresizingMask |= UIViewAutoresizingFlexibleWidth;
    label.autoresizingMask |= UIViewAutoresizingFlexibleBottomMargin;
    [superview addSubview:label];
    frame = button.frame;
    frame.origin.x = 0 + 50;
    frame.size.width = 100;
    frame.origin.y = 0 + 10 + 20 + 15;
    frame.size.height = 50;
    button.frame = frame;
    button.autoresizingMask |= UIViewAutoresizingFlexibleRightMargin;
    button.autoresizingMask |= UIViewAutoresizingFlexibleBottomMargin;
    [superview addSubview:button];
}
// --- END OF CODE GENERATED FROM VFL ---
And triggering the script, native UI alignment code will emerge between VFL begin and VFL end. They are inside curly braces so that you can fold them, and look at just the VFL part. If your design changes you can just edit the VFL definition and run the script again. Of course you can also automatically trigger the generation process with an Xcode pre-build script.

To learn more about vfl2objc, please visit http://github.com/hulu/vfl2objc

Defining Events

While NSNotification is something very handy, it is sometimes bug prone because of the nature of weak typing. For example, when you post a notification, the compiler cannot check if the notification name actually matches the name expected by the observers. Also any information passed along the notification is either a weakly typed object, or a dictionary. In order to reduce the incidence of bugs you need document the notification clearly, and always keep this documentation up to date even as your application changes. Decoupling documentation from implementation often leads to people updating one but not the other, thus becoming the source of problems.

To deal with this issue, we introduced a replacement for NSNotification — the HUTypedEvents. The basic idea is to use a protocol, plus a method signature, to represent an event. This allows the compiler to check both the event name and parameters.

For example, if we have an event PlaybackStart with integer parameters videoID and showID. If we were to do it with NSNotification, the code to define it will be:

const NSString *PlaybackStartNotification = @"PlaybackStart";
const NSString *PlaybackStartVideoIDKey = @"PlaybackStartVideoID";
const NSString *PlaybackStartShowIDKey = @"PlaybackStartShowID";
In order to post the event, somewhere inside the playback component we need to call:
[NSNotificationCenter defaultCenter] 
   postNotificationName: PlaybackStartNotification
   object: nil
   userInfo: @{PlaybackStartVideoIDKey: [NSNumber numberWithFormat:videoID], 
      PlaybackStartShowIDKey: [NSNumber numberWithFormat:showID]}];
Anyone who needs to handle this event will need to do:
// during init
[[NSNotificationCenter defaultCenter] 
   registerObservorForNotificationName: PlaybackStartNotification 
   selector:@selector(handlePlaybackStart:) object:nil];
...
// implement handler
- (void)handlePlaybackStart:(NSNotification *)notification {
int videoID = [[notification.userInfo 
    objectForKey:PlaybackStartVideoIDKey] intValue];
// some logic based on video ID
...
}
With HUTypedEvents, this is both simpler and safer. Create a singleton class, for example CentralEventHandler. Declare the event at the beginning of CentralEventHandler.h outside the class interface:
HUDeclareEvent(PlaybackStart, videoID:(int)videoID showID:(int)showID)
In the class interface, add:
HUDeclareEventRegistration(PlaybackStart, videoID:(int)videoID showID:(int)showID)
In the implementation for the class, simply add:
HUImplementEvent(PlaybackStart, videoID:(int)videoID showID:(int)showID)
Then, to trigger this event, call:
[[CenterEventHandler instance] 
   handleEventPlaybackStart__videoID:videoID 
   showID:showID];
And to handle this event, let the class conform to protocol EventPlaybackStart, and then do:
// during init
[[CentralEventHandler instance] registerEventPlaybackStart__observer:self];
...
// implement handler
- (void)handleEventPlaybackStart__videoID:(int)videoID showID:(int)showID {
// logic based on videoID and showID
}
This approach provides the following advantages:

  • When you trigger the event, you don’t have to look up or remember the event name, parameter names, or paramter types.
  • You don’t have to convert primitive types into objects.
  • Whenever you are triggering or handling your event, Xcode will provide autocomplete on the method and parameter names.
  • Whenever you register to receive an event, the compiler will check whether the class conforms to the right protocol, and will ensure that you implemented the right handler method with the right parameters and types.
  • If you want to find every use of this event — both where it’s fired, and where it’s handled — you can peform a global code search on “handleEventPlaybackStart”. The search results are much cleaner than when you attempt to do the same with NSNotification and search for PlaybackStartNotification.

To learn more about HUTypedEvents, please checkout http://github.com/hulu/HUTypedEvents

Text Rendering

We’ve upgraded GSFancyText, our open-source rich text rendering framework. We moved all the markup text parsing, line breaking, and alignments to a background thread, and only do the final drawing on main thread. Since our new application has a lot of scroll views that contain lots of images and rich text labels, this new approach is much more performant.

The library’s usage is similar to the older version, except that you need to explicitly call updateDisplay after a GSFancyTextView object is constructed. In case you need to align some other UI based on the size of a GSFancyTextView obtained from the asynchronous calculation, you can use a new method called updateDisplayWithCompletionHandler.

The vfl2objc system, HUTypedEvents, and the updated GSFancyText are just a few examples of integrating little, fun technical improvements into our big iPad app redesign project. During the process of building the Hulu iOS application, we are constantly looking for better ways to make it easier to build a nice looking, maintainable, and performant application. We’re happy to share some of our learnings and code with the wider development community.

Bao Lei is a senior software developer on the mobile team who works on our iOS platform.

Last comment: Aug 26th 2014 1 Comment

RestfulGit: an open source web service for accessing git data

September 9th, 2013 by Rajiv Makhijani

We love Git at Hulu — it’s central to almost all software development here. RestfulGit was built to make it easier to build tools and processes that leverage data from our internal Git repositories. It provides a read-only restful web API for accessing low-level Git data. For compatibility, and to make it easier to build tools that can access both open-source hosted Git repos and internally hosted repos, the API was modeled off of the GitHub Git DB API.

The API provides the following endpoints:

Commits

Retrieves a list of commit objects:

GET /repos/:repo_key/git/commits

optional: ?start_sha=:sha
optional: ?ref_name=:ref_name
optional: ?limit=:limit (default=50)

Retrieves specific commit object:

GET /repos/:repo_key/git/commits/:sha

Blobs

Retrieves a specific blob object:

GET /repos/:repo_key/git/blobs/:sha

Trees

Retrieves a specific tree object:

GET /repos/:repo_key/git/trees/:sha

Refs

Retrieves a list of refs:

GET /repos/:repo_key/git/refs

Retrieves a specific ref:

GET /repos/:repo_key/git/refs/:ref_name

Raw Files

Returns the raw file data for the file on the specified branch:

GET /repos/:repo_key/raw/:branch_name/:file_path

For more information on the currently supported features, see the read me. Of course, this project would not be possible without the many open-source projects it is built upon: Flaskpygit2libgit2, and Python.

Find the source code on GitHub. Contributions welcome!

Last comment: Aug 31st 2014 1 Comment

Tips and Highlights from Developing Mobile apps for Windows

July 15th, 2013 by Zachary Pinter

For the past year or so, our mobile team has been hard at work on creating native apps for the Microsoft ecosystem.

Now that we’ve launched both Hulu Plus for Windows Phone 8 and Hulu Plus for Windows 8, I’d like to take some time reflect on their development by showing off some of the platform highlights.

Async/Await

The first and most obvious highlight is the async and await keywords added to C# 5. This pattern has been enormously helpful in cutting down code complexity and keeping our UI fast and responsive.

Typically, when writing code that makes use of asynchronous libraries (such as node.js), you end up structuring your project’s code around the handling of callbacks. With async/await, you get to write code that takes full advantage of asynchronous, non-blocking, off-the-ui-thread functions, while organizing your code much the same way as you might have done using synchronous calls.

There’s a lot of depth and detail about how this is done under the hood (see here), but I think it’s best to start with few quick examples:

public async Task<string> Nonce()
{
   var res = await ExecuteSiteRequest(new NonceRequest());
   return res.Value;
}

public async Task<SiteUser> Login(string username, string password)
{
    var req = new AuthenticateRequest(username, password, await Nonce());
    var res = await ExecuteSiteRequest(req);
    var user = new SiteUser(res);

    return user;
}

In the above example, the actual delay involved in waiting for the HTTP calls triggered by ExecuteSiteRequest happen outside the UI thread. However, the rest of the logic runs on the UI thread. The intelligence of the async/await keywords comes from the compiler knowing how to rewrite the methods and choosing when to run the various fragments of code that surround the calls to ExecuteSiteRequest.

Notice how the await keyword is placed inside the constructor of the AuthenticateRequest object. This means that before the first AuthenticateRequest object ever gets constructed, the framework will first wait for the successful execution of the async Nonce() method. After the Nonce() method’s call to ExecuteSiteRequest returns successfully, the code in the Login method resumes where it left off (now on the UI thread) and is able to construct the AuthenticateRequest.

If any of the calls to ExecuteSiteRequest were to fail (e.g. a server-side error), an exeception would bubble up on the UI thread just as if this were a blocking method (even if the original source of the exception came from a worker thread executing the HTTP request). We added a few keywords, but the organization of the code remains simple and straightforward.

Keep in mind that the value provided by these keywords is not limited to IO calls. The metaphor extends to anywhere you might want to break away from the current thread. For example, a common issue that frequently comes up in UI development is stuttering/freezing and how to prevent it. The obvious first step is to move all your service calls off the UI thread, though you can still end up with a stuttering app if you take up too much CPU time on the UI thread. For example, you might execute your HTTP call off the UI thread, but then process the result on the UI thread. In C# 5.0, you can pass a lambda to Task.Run (which then executes that lambda on a worker thread pool) and await the result.

Here’s an example of how we use Task.Run to process JSON:

// Example:
//  var user = await JsonToObject<User>(userJson);
public async Task<T> JsonToObject<T>(string json)
{
    var res = await Task.Run<T>(() => {
        return JsonConvert.DeserializeObject<T>(json,
            new JsonSerializerSettings
            {
                MissingMemberHandling = 
                 Newtonsoft.Json.MissingMemberHandling.Ignore
            });
    });
    return res;
}

The same sort of pattern can be applied to any snippet of code or function call that you might expect to be CPU intensive. Assuming the snippet doesn’t try to manipulate shared state, just add in the async/await keywords and wrap the snippet in a call to Task.Run.

Static Extension Methods

Extension methods have been in C# for quite some time (since 3.0), but I’ve found several handy use cases for them during the development of our Windows 8 and Windows Phone 8 applications.

One of them I’d like to highlight is type conversion helper methods. Like many mobile apps, we work with and parse data from a variety of different backend services and libraries. A lot of times, this means dealing converting to and from typed and primitive values (e.g. string, double, bool, int, object, Double). So, we created extension methods to make the process easier:

public static int AsInt(this object me, int defaultValue = default(int))
{
    if (me == null) return defaultValue;
    if (me is string)
    {
        int result;
        if (int.TryParse(me as string, out result))
        {
            return result;
        }
    }

    if (IsNum(me))
    {
        return Convert.ToInt32(me);
    }

    return defaultValue;
}

public static bool AsBool(this object me, bool defaultValue = default(bool))
{
    if (me == null) return defaultValue;

    if (me is string)
    {
        var meStr = (me as string);

        if (("true".Equals(meStr, StringComparison.OrdinalIgnoreCase)) ||
            ("1".Equals(meStr, StringComparison.OrdinalIgnoreCase))
        )
        {
            return true;
        }
        if (
            ("false".Equals(meStr, StringComparison.OrdinalIgnoreCase)) ||
            ("0".Equals(meStr, StringComparison.OrdinalIgnoreCase))
            )
        {
            return false;
        }
    }

    if (me is bool)
    {
        return (bool)me;
    }
    return defaultValue;
}

Granted, in many cases you can just type cast and be done with it (e.g. var x = (int)foo). However, doing so naively can run into issues later on if the server-side data changes format (e.g. values that are largely whole numbers, but can have decimal values under some scenarios).

Some areas where these extension methods are helpful:

  • Saved settings from IsolatedStorageSettings

  • Parsing page paremeters in NavigationContext.QueryString

  • Json objects represented as IDictionary

Another benefit of using static extensions is that they can be chained and still behave correctly when null values are returned earlier in the chain.

For example:

protected virtual void LoadState(System.Windows.Navigation.NavigationEventArgs e, 
                                 IDictionary<String, Object> pageState)
{
    // omitted code...

    bool resetHistory = 
       this.NavigationContext.QueryString.ValueOrDefault("reset_history").AsBool();

    // omitted code...

    if (resetHistory)
    {
        while (NavigationService.RemoveBackEntry() != null)
        {
            ; // do nothing
        }
    }
}

In the above example, ValueOrDefault can return null (the default value for a string) if the “reset_history” string isn’t present in the QueryString dictionary. However, that’s no problem for our AsBool() helper, since it returns the default boolean value (false) if called on a null object.

The ability to consolidate all this type conversion logic into easy-to-call, composable extension methods makes our code both cleaner and safer.

For the full set of type conversion helpers we use, see ObjectHelper.cs.

XAML Data Binding and MVVM

A fairly common idea in UI development is that your views should be decoupled from your models in a way that allows for, at least in theory, alternative views to share the same model.

One of the best ways to achieve this on Microsoft platforms is through XAML and the MVVM pattern. XAML (eXtensible Application Markup Language) is an XML format for specifying UI layouts and templates. MVVM (Model-View-ViewModel) is a design pattern in the same genre as MVC with a focus on data binding.

Let’s see how this works in practice:

<Grid x:Name="detailsContainer" Grid.Row="1" Background="White" 
  VerticalAlignment="Top" Grid.ColumnSpan="2">
    <Grid Margin="15">
        <Grid.RowDefinitions>
            <RowDefinition Height="Auto"/>
            <RowDefinition Height="Auto"/>
            <RowDefinition Height="*"/>
        </Grid.RowDefinitions>

        <b:BehaviorCollection.Behaviors>
            <b:EventBehavior Event="Tap" Action="{Binding TileTextClickedAction}"/>
        </b:BehaviorCollection.Behaviors>

        <TextBlock
            Margin="0,15,0,8"
            Style="{StaticResource MastheadShowTitleTextStyle}"
            Text="{Binding Title}" 
            TextWrapping="NoWrap"
            TextTrimming="WordEllipsis"
            Visibility="{Binding Title, 
                               Converter={StaticResource HuluConverter}, 
                               ConverterParameter='NotBlankToVisible'}"
            />

        <TextBlock 
            Grid.Row="1"
            Margin="0,0,0,4"
            Style="{StaticResource MastheadVideoTitleTextStyle}"
            Text="{Binding Subtitle}"
            TextWrapping="NoWrap"
            TextTrimming="WordEllipsis"
            Visibility="{Binding Subtitle, 
                               Converter={StaticResource HuluConverter}, 
                               ConverterParameter='NotBlankToVisible'}"
            />

        <TextBlock 
            Grid.Row="2"
            Style="{StaticResource MastheadDescriptionTextStyle}"
            TextWrapping="Wrap"
            TextTrimming="WordEllipsis"
            Text="{Binding Description}" MaxHeight="80"
            Visibility="{Binding Description, 
                               Converter={StaticResource HuluConverter}, 
                               ConverterParameter='NotBlankToVisible'}"
            />
    </Grid>
</Grid>

In the above example we have three text blocks where any given text block can be hidden/collapsed if the text it’s bound to is blank (null or an empty string). Any time the title, subtitle, or description changes, the UI will automatically be updated (and each text block will be made visible or collapsed as needed). Behind the scenes, there’s a view model object (a fairly simple/typical C# object) driving this view. However, the view model has no specific knowledge about what this view looks like or how it is arranged (nor does it need to). Instead, the view model’s chief concern is in providing a useful set of bindable properties (getters/setters) that any particular layout might want.

XAML and MVVM ends up being a really nice workflow/pattern when it comes to iterating designs and reusing code across different templates (e.g. templates for Windows 8 snapped mode versus full screen mode). With a bit of legwork, this reusability extends even across applications. For example, the bulk of our Windows Phone 8 templates are bound to the same view models we created for the Windows 8 app.

If you’re interested in learning more about the MVVM pattern and how it helped drive the development of our Windows 8 and Windows Phone 8 apps, check out this video from Build 2013.

XAML Control Template Overrides

Another perk of working with XAML-based frameworks is that you can extract and override the standard XAML templates used by the built-in controls.

For example, if you want a Windows Phone 8 LongListSelector to have some extra padding at the bottom of the list (but inside the scrollable region), just override the template and add the padding to the inner ViewportControl:

<Style x:Key="LongListSelectorWithBottomPadding" 
  TargetType="fcontrols:HLongListSelector">
    <Setter Property="Template">
        <Setter.Value>
            <ControlTemplate TargetType="fcontrols:HLongListSelector">
                <Grid Background="{TemplateBinding Background}">
                    <VisualStateManager.VisualStateGroups>
                        <VisualStateGroup x:Name="ScrollStates">
                            <VisualStateGroup.Transitions>
                                <VisualTransition GeneratedDuration="00:00:00.5"/>
                            </VisualStateGroup.Transitions>
                            <VisualState x:Name="Scrolling">
                                <Storyboard>
                                    <DoubleAnimation Duration="0" To="1" 
                                      Storyboard.TargetProperty="Opacity" 
                                      Storyboard.TargetName="VerticalScrollBar"/>
                                </Storyboard>
                            </VisualState>
                            <VisualState x:Name="NotScrolling"/>
                        </VisualStateGroup>
                    </VisualStateManager.VisualStateGroups>
                    <Grid Margin="{TemplateBinding Padding}">
                        <Grid.ColumnDefinitions>
                            <ColumnDefinition Width="*"/>
                            <ColumnDefinition Width="auto"/>
                        </Grid.ColumnDefinitions>
                      <!-- Note the bottom padding being set to 80 -->
                        <ViewportControl x:Name="ViewportControl" 
                          Padding="0,0,0,80"
                          HorizontalContentAlignment="Stretch" 
                          VerticalAlignment="Top"/>
                        <ScrollBar x:Name="VerticalScrollBar" 
                          Grid.Column="1" Margin="4,0,4,0" 
                          Opacity="0" Orientation="Vertical"/>
                    </Grid>
                </Grid>
            </ControlTemplate>
        </Setter.Value>
    </Setter>
</Style>

The template itself is large and might look like a lot of gibberish due to all the built in styling and transitions. However, we didn’t have to write all of that. You can find the built-in control’s default XAML using Expression Blend (just right-click on the component and choose Edit Template -> Edit a Copy). From there, we simply copy the default XAML and make our tweaks/changes as needed.

This separation of UI from behavior, found in all the system controls, ends up providing a lot of power when you start approaching the limits of the built-in controls and want to extend your UI to do things outside of the original design. For example, if you want to use the FlipView component on Windows 8, but you don’t like the appearance of the previous/next buttons, override the FlipView template and provide your buttons (see how this works for our masthead). The logic of the control (animations, multitouch gestures, item recycling, etc) all stays the same.

Zachary is a software developer on the mobile team at Hulu, and has worked on our Android, Windows 8, and Windows Phone apps.

Last comment: Aug 29th 2014 1 Comment

Bank: an open source Statsd/Metricsd aggregation frontend

July 9th, 2013 by Feng Qi

At Hulu we often use Statsd and Metricsd. Metrics of our high-through-traffic services are continuously sent to Statsd/Metricsd clusters, and are eventually surfaced via Graphite. Human or machine can then take the advantage of the Graphite UI or API to set up customized monitoring and alerting systems.

While we enjoyed the connectionless packet transmission enabled by the UDP ports of Statsd and Metricsd, we found the metric data carried inside such UDP packets proportionally small. For example, to tell that the API “hulu_api” has processed one request with 30 milliseconds, we send this message to Statsd: “hulu_api:30|ms”. Consider, for a moment, how this looks at the level of a UDP packet on top of IP. With a UDP header of 8 bytes and an IP header of 20 bytes, the 14-byte metric data corresponds to only 33% of the bytes transferred. Aggregating many such packets into a single packet can increase this percentage. In the previous example, if we combine many hulu_api metrics and send 1000 bytes of metrics data within one packet, the data portion of the payload will correspond to 97% of the bytes transferred. This is a great improvement. To what extent can such aggregation be helpful depends on the Maximum Transmission Unit (MTU) of a specific network path. Sending a packet of larger length will cause fragmentation of data frames, which again introduces header overhead.

Using packet aggregation, the advantage of reducing transmission overhead comes with a few disadvantages as well. First, it reduces the temporal resolution of time-series data, because packets within a time window will be sent together. Second, packet loss rate may increase due to checksum failures, because longer packets have a higher probability of data corruption, and any failure in a checksum will result in the receiver discarding that entire packet. Third, sending larger packets takes more time and hence increases latency. To overcome the first disadvantage, we can use a time threshold — when this threshold is reached, we send the aggregated packet even if aggregation has not reached its max capacity. The second and third disadvantages cannot be compensated easily, but it is usually reasonable to assume a network to be still accurate and fast when packet sizes are close to but below the MTU.

When we have a target temporal resolution, we can perform some simple math-based data aggregations within the time window:

  1. Many packets of “Count” data type can be aggregated into one packet with their values summed: “hulu_api_success:10|c” and “hulu_api_success:90|c” can be aggregated to “hulu_api_success:100|c”
  2. Many packets of “Gauge” data type can be aggregated into one packet with the one latest value: “hulu_api_queue_size:30|g” and “hulu_api_queue_size:25|g” can be aggregated to “hulu_api_queue_size:25|g”
  3. Other data types are difficult for math aggregation without re-implementing functionalities of Statsd or Metricsd.

Implementing math-based aggregation is actually not trivial. It requires parsing incoming UDP data, organizing them by metric names and data types, and performing the math operation. The benefit is that it reduces the actual amount of metrics information you send. In the scenario that a high frequency of count or gauge data points are sampled, hundreds of messages can be reduced to one!

To implement above aggregation logic, one could implement it within each sender. Instead of spreading this complexity throughout our applications, we decided instead to build a independent daemon. A standalone daemon has all the advantages of modularization: no matter with what language a service is implemented, and no matter how many different services one machine has, the network admin just needs to bring up one daemon process to aggregate all the packets sent from that host.

We decided to call the project “Bank” because we save penny packets and hope the “withdrawals” to be large. Here are some interesting details of Bank:

  1. Bank sends an aggregated packet when one of the three criteria is met:
    • a configurable max packet length is reached
    • a configurable number of packets have been aggregated
    • a configurable time interval has elapsed
  2. Bank can be used as the frontend of both Statsd and Metricsd because it respects both protocols. For example, it supports the Metricsd data type “meter”, which allows clients to omit “|m”. Also it does not send the Statsd style multi-metric packet, which is not understood by Metricsd.
  3. Bank respects “delete”. A packet with data “hulu_api:delete|h” will cause bank to discard current histogram data of hulu_api and send “hulu_api:delete|h” to downstream.
  4. Bank can parse packets that are already aggregated. Sometimes certain smart upstreams do certain level of aggregation by themselves, and sometimes math aggregation can be carried out at multiple levels, so it is nice to be able to parse the aggregated packets.

When our Seattle team tried Bank against their Statsd cluster, they decided to add downstream consistent hashing logic to it. To work with a cluster of Statsd machines, the same metric from different frontend should be sent to the same Statsd instance, because that instance will need all the data to do statistics. Consistent hashing is a perfect choice in this scenario. Each Bank uses the TCP health check of Statsd to determine which downstream Statsd’s are up, and distribute packets based on metric names.

We implemented the initial version of Bank using Python in a weekend — indeed most of the time was spent on parsing and assembling strings to comply with the Statsd and Metricsd protocols. That version proved to not be as highly performant as we’d wanted it to be, and so we ended up rewriting Bank into C. This C version of Bank only uses standard libraries, so it should be portable across most platforms.

You can find the code at https://github.com/hulu/bank — please feel free to file issues or suggest patches if you end up using bank.

Feng is a software developer on the Core Services team.