• Hulu
  • TV
  • Movies
Hulu Tech Blog

Voidbox – Docker on YARN

August 6th, 2015 by Huahui Yang

1. Voidbox Motivation

YARN is the distributed resource management system in Hadoop 2.0, which is able to schedule cluster resources for diverse high-level applications such as MapReduce, Spark. However, nowadays, all existing framework on top of YARN are designed with assumption of specific system environment. How to support user applications with arbitrary complex environment dependencies is still an open question. Docker gives the answer.

Docker is a very popular container virtualization technology. It provides a way to run almost any application isolated in a container. Docker is an open platform for developing, shipping, and running applications. Docker automates the deployment of any application as a lightweight, portable, self-sufficient container that will run virtually anywhere.

In order to integrate the unique advantages of Docker and YARN, the Hulu engineering team developed Voidbox. Voidbox enables any application encapsulated in docker image running on YARN cluster along with MapReduce and Spark. Voidbox brings the following benefits:

  • Ease creating distributed application
    • Voidbox handles most common issues in distributed computation system, say it, cluster discovery, elastic resource allocation, task coordination, disaster recovery. With its well-designed interface, it’s easy to implement a distributed application.
  • Simplify deployment
    • Without Voidbox, we need to create and maintain dedicated VM for application with complex environment even though the VM image is huge and not easy to deploy. With Voidbox, we could easily get resource allocated and make app run right the time we need it. Additional maintenance work is eliminated.
  • Improve cluster efficiency
    • As we could deploy Spark/MR and all kinds of Voidbox applications from different department together, we could maximize cluster usage.

Thus, YARN as a big data operating platform has been further consolidated and enhanced.

Voidbox supports Docker container-based DAG(Directed Acyclic Graph) tasks in execution. Moreover, Voidbox provides several ways to submit applications considering demands of the production environment and the debugging environment. In addition, Voidbox can cooperate with Jenkins, GitLab and private Docker Registry to set up a set of developing, testing, automatic release process.

2.Voidbox Architecture

2.1 YARN Architecture Overview

YARN enables multiple applications to share resources dynamically in a cluster. Here is the architecture of applications running in YARN cluster:


Figure 1. YARN Architecture

As shown in figure 1, a client submits a job to Resource Manager. The Resource Manager performs its scheduling function according to the resource requirements of the application. Application Master is responsible for the application tasks scheduling and execution of an application’s lifecycle.

Functionality of each modules:

  • Resource Manager: Responsible for resource management and scheduling in cluster.
  • NodeManager: Running on the compute nodes in cluster, taking care of task execution in the individual machine, collecting informations and keeping heartbeat with Resource Manager.
  • Application Master: Takes care of requesting resources from YARN, then allocates resources to run tasks in Container.
  • Container: Container is an abstract notion which incorporates elements such as memory, cpu, disk, network etc.
  • HDFS: Distributed file system in YARN cluster.

2.2 Voidbox Architecture Design

In Voidbox architecture, YARN is responsible for the cluster’s resource management. Docker acts as the task execution engine above of the operating system, cooperating with Docker Registry. Voidbox helps to translate user programming code into Docker container-based DAG tasks, apply for resources according to requirements and deal with DAG in execution.


Figure 2. Voidbox Architecture

As shown in figure 2, each box stands for one machine with several modules running inside. To make the architecture more clearly, we divide them into three parts, and functionality of Voidbox modules and Docker modules:

  • Voidbox Modules:
    • Voidbox Client: The client program. Through Voidbox Client, users can submit a Voidbox application, stop it, and so on. By the way, Voidbox application contains several Docker jobs and a Docker job contains one or more Docker tasks.
    • Voidbox Master: Actually, it’s an application master in YARN, and takes care of requesting resources from YARN, then allocates resources to Docker tasks.
    • Voidbox Driver: Responsible for task scheduling of a single Voidbox application. Voidbox supports Docker container-based DAG task scheduling and between tasks we can insert some other codes. So Voidbox Driver should handle the order scheduling of DAG task dependencies and execute the user’s code.
    • Voidbox Proxy: The bridge between YARN and Docker engine, responsible for transiting commands from YARN to Docker engine, such as start or kill Docker container, etc.
    • State Server: Maintaining the informations of Docker engine’s health status, providing the list of machines which can run Docker container. So Voidbox Master can apply for resources more efficiently.
  • Docker Modules:
    • Docker Registry: Docker image storage, acting as an internal version control tool of Docker image.
    • Docker Engine: Docker container execution engine, obtaining specified Docker image from Docker Registry and launching Docker container.
    • Jenkins: Cooperating with GitLab, when application codes update, Jenkins will take care of automated testing, packaging, generating the Docker image and uploading to Docker Registry, to complete the application automatically release process.

2.3 Running Mode

Voidbox provides two application running modes: yarn-cluster mode and yarn-client mode.

In yarn-cluster mode, the control component and resource management component are running in the YARN cluster. After we submit the Voidbox application, Voidbox Client can quit at any time without affecting the running time of application. It’s for the production environment.

In yarn-client mode, the control component is running in Voidbox Client, and other components are in the cluster. Users can see much more detailed logs about the application’s status. When Voidbox Client quits, the application in cluster will exit too. So it’s more convenient for debugging.

Here we briefly introduce the implementation architecture of the two modes:

  • yarn-cluster mode


Figure 3. yarn-cluster mode

As shown in figure 3, Voidbox Master and Voidbox Driver are both running in the cluster. Voidbox Driver is responsible for controlling the logic and Voidbox Master takes care of application resource management.

  • yarn-client mode


Figure 4. yarn-client mode

As shown in figure 4, Voidbox Master is running in the cluster, and Voidbox Driver is running in Voidbox Client. Users can submit Voidbox application in IDE for debugging.

2.4 Running Procedure

Here are the procedures of submitting a Voidbox application and its lifecycle:

  1. Users write a Voidbox application by Voidbox SDK and generate a java archive, then submit it to the YARN cluster by Voidbox Client;
  2. After receiving Voidbox application, Resource Manager will allocate resources for Voidbox Master, then launch it.
  3. Voidbox Master starts Voidbox Driver, the latter will decompose Voidbox application into several Docker jobs(a job contains one or more Docker tasks). Voidbox Driver calls Voidbox Master interface to launch the Docker tasks in compute nodes.
  4. Voidbox Master requests resources from Resource Manager, and Resource Manager allocates some YARN containers according to the YARN cluster status. Voidbox Master launches Voidbox Proxy in YARN container, and the latter is responsible for communication with Docker engine to start the Docker container.
  5. User’s Docker task is running in Docker container, and the log output to a local file. User can see real-time application logs through YARN Web Portal.
  6. After all Docker tasks are done, the logs will be aggregate to HDFS, so user still can get the application logs by history server.

2.5 Docker integrating with YARN in resource management

YARN acts as a uniform resource manager in the cluster, and is responsible for resource management on all machines. Docker as a container engine also has the function of resource management. So how to integrate their resource management function is particularly important.

In YARN, the user task can only run in the YARN container, while Docker container can only be handled by Docker engine. This case would get out of the management of YARN and damage the unified management and scheduling principle of YARN, which could produce resource leaks risk issue. In order to enable YARN to manage and schedule Docker container, we need to build a proxy layer between YARN and Docker engine. This is why Docker Proxy is introduced. Through Voidbox Proxy, YARN can manage the container lifecycle including start, stop, etc.

In order to understand Voidbox Proxy more clearly, we take stopping Voidbox application as an example. When a user needs to kill Voidbox application, YARN will recycle all the resources of the application. At this point, YARN will send a kill signal to the related machines. The corresponding Voidbox Proxy will catch the kill signal, then stop Docker container in Docker engine to do the resource recycling. So with the help of Voidbox Proxy, it can not only stop YARN container, but also stop the Docker container to avoid resources leaks issue(This is the problem existing in open source version, see YARN-1964).

3. Fault Tolerance

Although Docker has some stable releases, the enterprise production environment has a variety versions of operating system or kernel, so it brings unstable factors. We consider multiple levels in Voidbox fault-tolerant design to ensure Voidbox’s high availability.

  • Voidbox Master fault tolerance
    • If Resource Manager finds Voidbox Master crashes, it will notify NodeManager to recycle all the YARN containers belonging to this Voidbox application, then restart Voidbox Master.
  • Voidbox Proxy fault tolerance
    • If Voidbox Master finds Voidbox Proxy crashes, it will recycle Docker containers on behalf of Voidbox Proxy.
  • Docker container fault tolerance
    • Each Voidbox application can configure the maximum retry times on failure, when the Docker container crashes, Voidbox Master will do some work according to the exit code of Docker container.

4. Programming model

4.1 DAG Programming model

Voidbox Provides Docker container-based DAG programming model. A sample would look similar to this:


Figure 5. Docker container-based DAG programming model

As shown in figure 5, there are four jobs in this Voidbox application, and each job can configure its requirements of CPU, Memory, Docker image, parallelism and so on. Job3 will start when job1 and job2 both complete. Job1, job2 and job3 make a stage, so user can insert their codes after this stage is done, and finally start running job4.

4.2 Shell mode to submit one task

In most cases, we would like to run a single Docker container-based task without programming. So Voidbox supports shell mode to describe and submit the Docker container-based task, actually it’s a implementation based on DAG programming mode.The example usage of Voidbox in shell mode:

docker-submit.sh \

-docker_image centos \

-shell_command “echo Hello Voidbox” \

-container_memory 1000 \

-cpu_shares 2

The shell script above will submit a task to run “echo Hello Voidbox” in a docker image named ‘centos’, and the resource requirement is 1000Mb memory, 2 cpu virtual cores. 

5. Voidbox in Action

At present we can run Docker, MapReduce, Spark and other applications in YARN cluster. There has been lots of short tasks using Voidbox within HULU.

  • Automation testing process
    • Cooperating with Jenkins, GitLab and private Docker registry, when the application codes update, Jenkins will complete automatic test, package program, regenerate Docker image and push it to the private Docker Registry. It’s a process of development, testing and automatically release.
  • Complex tasks in parallel
    • Test Framework is used to do some testings to detect the availability of some components. The project is implemented by Ruby/Java and has complex dependencies. So we maintain two layers of Docker image, the first layer is the system software as a base image, and the second layer is the business level. We publish a test framework Docker image and use some timing scheduling software to start Voidbox application regularly. Thanks to Voidbox, we solve the issues such as the complex dependencies and the multitasking parallelism.
    • Facematch(link:http://tech.hulu.com/blog/2014/05/03/face-match-system-overview/) is a video analysis application. It’s implemented by C and has lots of graphics libraries. That can be optimized by Voidbox: first of all we need to package all face match program into a Docker image, then write Voidbox application to handle the multiple videos. Voidbox solves the complex machine environment and the parallelism control problem.
  • Building complex workflow
    • Some tasks have a dependent with each other, such as it needs to load user behaviors first, then do the analysis of user behaviors. These two steps have successively dependencies. We use Voidbox container-based programming model to handle this case easily.

6. Different from DockerContainerExecutor in YARN 2.6.0

  • DockerContainerExecutor(link:https://issues.apache.org/jira/browse/YARN-1964) is released in YARN 2.6.0 and it’s alpha version. Not mature enough, and it is only an encapsulation layer above the default executor.
  • DockerContainerExecutor is difficult to coexist with other ContainerExecutor in one YARN cluster.
  • Voidbox features
    • DAG programming model
    • Configurable container level of fault tolerance
    • A variety of running modes, considering development environment and production environment
    • Share YARN cluster resources with other Hadoop job
    • Graphical log view tool

7. Future work

  • Support more versions of YARN
    • Voidbox would like to support more versions in the future besides YARN 2.6.0.
  • Voidbox Master fault tolerance, persistent metadata to reduce the cost in case of retry
    • Currently, if a Voidbox Master crashes, YARN will recycle resources belonging to this Voidbox application and restart Voidbox Master to do some tasks from the very beginning. It’s not necessary to impact tasks which are already done or running. We might keep some metadatas in the State Server to reduce the cost in case of Voidbox Master on-failure.
  • Voidbox Master as a permanent service
    • Voidbox will support long running Voidbox Master to receive streaming tasks.
  • Support long service
    • Voidbox will support long running service if Voidbox Master’s downtime doesn’t influence running task.

You Can Now Use the Apple Watch as a Hulu Remote

July 15th, 2015 by admin

Today, we are excited to announce that we’ve created a new Hulu application for the Apple Watch that brings some of the most important features of a remote control directly to your wrist and allows you to control your viewing experience with a few simple taps.

At Hulu, we are constantly trying to find new and innovative ways to make the viewing experience as seamless as possible for our viewers. The Hulu application on the Apple Watch is the perfect opportunity to explore the Apple Watch OS and experiment with ways to integrate the Hulu experience into the popularity of wearable platforms.

With the Hulu app, you will be able to play, pause and rewind your favorite shows on Apple TV, Chromecast, Xbox One, PS3 and PS4 with a tap on your wrist. You will also be able to toggle captions from the Hulu app for Apple Watch.

You will be able to connect directly to an existing Chromecast or Xbox ONE, PS3 or PS4 device that’s streaming Hulu, and control it right when you launch your Hulu app on the Apple Watch.

If you watch on Apple TV, you will have to first launch a Hulu stream via Apple TV on your phone, and then you will be able to use your Apple Watch Hulu application as a remote.

Hulu for Apple Watch is now available in the Apple Watch app store. Stay tuned for more updates and feature additions.

 The Hulu app for Apple Watch was implemented by the mobile team intern, Rahin Jegarajaratnam who was mentored by Bradley Snyder along with the iOS dev team. Our intern program is unique in that we actively use it as an opportunity for interns to work on projects that directly touch consumers.

Aggregation of Relevance Tables with Expert Labeling

May 26th, 2015 by Heng Su

(by Wenkui Ding and Heng Su)


In Hulu we continuously seek ways to improve our users’ content discovery experiences by various recommendation techniques. One of the most important components supporting the content discovery products is the relevance table.

The relevance table could be simply regarded as a 2-dimensional real-valued matrix showing how “relevant” or “similar” every two pieces of content are. Ultimately we need one single relevance table to generate the recommendation results such as autoplay videos or top 10 recommended shows for our users. But the problem is that, internally instead of one single relevance table, we get many (sub-) relevance tables from different sources, for example we have relevance tables generated from our users’ watch behaviors, search behaviors and the content metadata in Hulu, respectively. We will introduce how do we generate the final relevance table in production by aggregating these sub relevance tables with domain expert labeling information.

Without loss of generality, to simplify the problem, let’s assume all relevance tables are in the grain size of TV shows, i.e., each relevance table represents the relevance between TV shows.

The Workflow

The simplest and maybe the most intuitive way to aggregate the sub relevance tables is to manually evaluate the quality and assign a weight for each relevance table, then we can just do weighted linear combination of those relevance tables to generate the result. However apparently this is not good enough: First the quality of the relevance tables will change when they update; second this global model is not the best to capture all the useful information in those relevance tables. For example some accurate relevance information will be neglected because the overall quality of the relevance table containing it is low. So we use a more sophisticated aggregation algorithm.

The entire workflow is as the following. In the next chapters the details of the process will be described.

Fig 1. Relevance Table Aggregation Workflow

The “Libra” Front-end

First we have a front-end to collect domain experts’ label results with the name “Libra”. In each label result, the expert is presented three shows (referred to as a “show tuple”, denoted by A, B and C), and the expert should answer the question that “Which show in B and C is more relevant to show A?” The answer could be B or C. When both B and C are not relevant, or it’s hard to make the judgment, the expert could also select “skip this tuple”.

Note the same show tuple could be labeled by different experts or even the same expert at different time to test if consensus could be made.

Instead of letting the expert directly assign the relevance value between two shows, we prefer the above way because it’s much easier for the expert to compare two relevance values than decide one relevance value.

The Machine-learning-based Relevance Aggregation

Second, a machine-learning-based algorithm based on the label results (as ground truth) and the sub-relevance tables (as features) is introduced to generate the final combined relevance table for our various products.

The following objective function is defined:

where (k,i,j) is the labeling result showing that the expert prefers show i to show j for source show k; fmq17 is the vector containing the relevance values from show k to show i in all sub-relevance tables; fmq17 is a scoring function to generate the final relevance table, and fmq17 is a parameter controlling the “smoothness” of the objective function. So the goal is to minimize the objective L and get the optimal function fmq17.

We propose two ways to model fmq17: linear combination and non-linear combination.

(1) Linear Combination

fmq17 is modeled as follows:

where w is the weight vector. The optimal w could be generated by stochastic gradient descent (SGD), as the following algorithm shows.

Algorithm I

a. For each round of iteration, enumerate all labeling results and for each label result (k,i,j), do the following:

  (i). Calculate the gradient of the loss function w.r.t the weight vector:   (ii). Update the weight vector: b. The iteration is repeated until the weight vector converges or the total loss is lower than a threshold.


(2) Non-linear (Additive) Combination

fmq17 is modeled as a boosted regression tree:

where fmq17 is a one-level tree, i.e.fmq17 is a binary function with only 0 and 1 as output and is generated by thresholding one of the sub relevance tables, and fmq17 is the corresponding multiplier.

The gradient boosting method is used to get the (sub-) optimal fmq17, as the following algorithm describes.

Algorithm II

a. Initialize fmq17

b. In each round of iteration, do the following:

  (i). Enumerate all pairs of shows and for each show pair , compute the pseudo-residuals for each item:

  (ii). Fit a sub-model fmq17 to pseudo-residuals fmq17.

  (iii). Compute the multiplier by:

  (iv). Update the model:

c. Repeat the above process until the total loss is below a threshold.



Significant improvements have been observed from experiments using the new aggregated relevance table, compared with the manual-weighted linear combination method. We have found:

  1. Around 4.51%+ on eCTR (effective CTR) on the “YouMayAlsoLike” tray relatively, and
  2. Around 5.69%+ on watch minutes from the part of autoplay that is related to the relevance table.


The relevance table serves as an important part in the recommendation system in Hulu. Online experiments show significant improvements on the machine-learning-based relevance table aggregation over fixed weighted combination, especially with the non-linear combination method,. Furthermore there are still other questions to answer in our recommendation engine, such as how to ensure diversity, adjust the relevance table by users’ explicit feedback, and utilize context information.


Face Match System – Clustering, Recognition and Summary

May 4th, 2014 by Cailiang Liu

Following the workflow of the Face Match system, this blog entry introduces the third core technique: face track clustering and recognition.

Track tagging

When a series of face tracks have been extracted from a set of videos, the next step is to tag them automatically with some probable actor names from the given show. After all, manually processing all the tracks from scratch would be infeasible. The tags, with some sort of acceptable accuracy rate — let’s say 80 percent — provide valuable cues for a person to verify the tracks in groups. When presented in a user-friendly interface, the tags also improve the speed required to correct erroneous matches. Given that, we are seeking ways to improve tagging accuracy for face tracks. This naturally falls into the machine-learning framework, which is widely adopted in the computer vision research community. In this blog entry, we refer to this problem of automatically annotating faces (not tracks) as face tagging.

Traditionally, face verification technology tries to identify whether a given image belongs to a specific person in a set of candidates. Though successfully applied in controlled environments, the approach has strict assumptions: the video must have good lighting, the actors must be facing the camera, and their faces cannot be obscured. However, these assumptions do not hold in challenging environments such as TV shows or movies. The general research interest has recently turned toward solving for uncontrolled datasets like “Labeled Faces in the Wild”. And the benchmark of identifying whether two faces belong to the same person has attracted a lot of attention. However, the LFW database contains a lot of person with only two face samples. Thus, the benchmark could hardly cover the case of identifying many people in many poses and true wild environment.

In the machine-learning framework, the problem of track tagging essentially boils down to how to construct a proper track similarity function as a function of the similarity of the faces in the tracks. Because we are facing the largest dataset for face verification in research history, the time and labor for human verification have become the most critical metrics. Only by improving the accuracy of track tagging can we significantly reduce the time and labor for verification. There are a few aspects that impact the results: 1) The features of the toolset; 2) The learning approach; 3) The cold start problem. Still, because of the very large dataset, we are also constrained by the amount of processing time we can afford. Given the potential number of all videos available to Hulu, we need to reduce the processing time to less than one second per face image. Thus, we cannot afford the recent effective, yet heavy, methods such as dense local feature based methods. Next, we will elaborate on each of these aspects.

Features extraction

In the current system, we leverage quite a few kinds of visual information to improve the tagging accuracy. Compared with a single image, we are equipped with the temporal information provided by continuous face tracks. Fusing these tracks into a 3-D face model is an interesting alternative for us to explore in the future. For now, we’ve limited ourselves to select a few representative faces and have constructed the track similarity function as a function of the similarity of the representative faces.

First, we resize the image to ensure the face region is 80×80 pixels. Then we enlarge the selected region to 80×160 pixels by extending 40 pixels up and 40 pixels down, respectively. See Figure 1 for an example.

Standard face features such as global face features and LBP (local binary pattern) facial features are extracted on the global region and local regions respectively. The global face feature is extracted on the aligned faces with a 16×16 grid layout, with each grid containing a 58-dim LBP feature. The LBP facial features are extracted on each local facial window with a 4×4 grid layout and a 58-dim histogram is accumulated for each grid by different LBP code. These local histograms are concatenated into a vector of 928 dims.

A few face verification approaches require face alignment and face warp as a preprocessing step. The alignment process identifies landmark points in the face, e.g. the corners of the eye, the mouth and the nose. Then the face can be warped to the frontal position by triangulating the facial features and finding the affine mapping. Therefore, global face features can be extracted on distorted faces as well. However, in our experiments, we did not see much improvement using this step. This may due to the fragility of the alignment algorithm we used.

We assume that the given character’s appearance will not change often in one video. So we further incorporate a few other features to reflect the character’s appearance, including hair and face, as well as the environment in which he or she appears. More specifically, we extract texture and color features in respective areas of the face image to reflect hair and scenery. The LBP feature is also extracted on the full 80×160 region to represent the face as a whole. The importance weights among different modalities are learned afterward with some label information for face tracks.


Figure 1. Feature extraction for face tracks

Learning approach

The primary goal in this step is to construct a proper track similarity function as a function of the similarity of the underlying faces across the tracks.

Given a new video, the tracks of an actor usually will be more similar to tracks from the same actor in this specific video than tracks of the same actor from other videos. This is because the appearance of the actor will remain the mostly same in one given video. Thus label information from the current video is more valuable than that from other videos. With these labels, we can expect higher tagging accuracy, so we adopt an online learning scheme to incorporate the newly verified track labels from the current video at the earliest time.

As we need to handle several tens of thousands of actors in our system, building and maintaining a supervised model for all possible actors is infeasible, even though we need to deal with only 100 to 500 actors for a given show. Given online learning and a huge number of candidates, we adopt a k-Nearest Neighbor (kNN) based lazy-learning approach to annotate the faces independently and then vote among the face tags to determine the tag for the given track. The merit of such lazy learning is that we do not need to maintain any learned model and the newly acquired labels can be added instantly. As shown in Figure 2, after feature extraction, approximate kNN scheme is used to speed up the neighbor-finding process. For a face track X, the jth feature of ith face in X is denoted as Xij, and its nearest samples are denoted as fmq14 where S1 is the most similar neighbor and S2 is the second one, etc. Each face is represented by a linear combination of its nearest neighbors. The weight for each neighbor is adopted as the similarity of the target face to neighbor face. L2-norm is used in the similarity metric because L1-norm results in a worse performance and is far less efficient:


We treat faces with different poses in the same way since the database is large enough and faces will find neighbor faces with the same pose. With the coefficients bij, we can generate a voting distribution over the identity list aij:


To measure the reliability of the voting, we use the sparse concentration index  as confidence scores:


In order to fuse fmq17 to label samples Xij, we use the formula fmq9We define weighting function fmq5 where c2 is the part that magnifies votes with large confidence scores and vjk are fixed parameters need to learn. It means that when the confidence score is not high, the vote weight is lower.

Learning voting weights for features with structured output SVM

The standard structured output SVM primal formulation is given as follows:


The voting weight w is the stack of vectors vj. To learn w, we define fmq18, where fmq19 is a vector with only y-th row 1, which selects features for class y. And fmq20 maps a track to a matrix with confidences for different identities:


Learning a structured output SVM with kernel fmq15 defined above will result in weight vectors that best combine multi-view features in face track recognition. To vote the identity label for a track X, we use a formula as follows:


Fusing samples for the track label

One simply way to fuse different samples Xi is to use all identity distributions fmq16 in computing fmq15. However, there are mismatches because many samples are very similar and they all match to faces with wrong identities. In order to avoid these mismatches, we adopt a diversity-sampling algorithm GRASSHOPPER to select diverse samples. We define the similarity function for GRASSHOPPER:


where fmq11 are the most similar neighbor of fmq12.

Finally the label of the face track X is computed using the formula:


Experiments show that, with sufficiently large face databases, the precision of automatic track tagging would be as high as 95 percent when annotating 80 percent of the face tracks. For some high-quality episodes, the system is able to annotate 90 percent of face tracks with 95 percent accuracy. This significantly reduces the time required for manual confirmation.

After automatic tagging, the face tracks are clustered with respect to visual similarity and presented to human annotators for verification. The corrected labels are fed back into the system to further improve the tagging accuracy.

Cold start

The cold start phenomenon is frequently discussed in the research for recommendation systems. Due to lack of information for a newcomer to the system, no cue is available for deciding which items to recommend. Similarly, when a new show or a new actor comes into our system, we have no labeled information, and thus supervised learning is not feasible. In such a situation, we resort to unsupervised/semi-supervised learning approaches to provide the initial labels for a few tracks to the system.

Simple unsupervised hierarchical clustering is possible, but we can do better than that. Though we do not have label information for a new show or a new actor, we do have labels for other actors in other shows. Thus, with a few pre-built classifiers for each of the known actors, we construct a similarity vector to measure the similarities of the current track to the given set of known actors. See Figure 2 for details, where the small graph illustrates an example of one track’s classification scores to a list of known actors. Arguably, this similarity vector encodes some prior knowledge in the system, so we expect this semi-supervised learning scheme will outperform the unsupervised scheme. Experimental results show that the semi-supervised scheme increases 30 percent of the purity score for the clusters over the unsupervised scheme.


Figure 2. Computing track similarities (with respect to known actors) for face track clustering

Lessons learned

  • Combining face features and context features for hair and clothes improves annotation accuracy.
  • The online active learning scheme shows better results than offline ones.
  • Confirmation is an easier and faster task than annotation for humans. More accurate prediction results help a lot in reducing confirmation time.
  • Grouping visually similar tracks together for confirmation lightens manual workload and significantly reduces human reaction time.
  • The semi-supervised scheme helps solve the cold start problem, and therefore helps annotation.

Our exploration is a preliminary investigation of the track-tagging problem. This is an interesting open research problem and we will continue to improve the annotation accuracy.

This is the 4th blog of Face Match tech blog series. You can browse the other 3 blogs in this series by visiting:

1. Face Match System – Overview

2. Face Match System – Face Detection

3. Face Match System – Shot Boundary Detection and Face Tracking

Last comment: about 7 hours ago 1 Comment

Face Match System – Shot Boundary Detection and Face Tracking

May 3rd, 2014 by Tao Xiong

Following the workflow of the Face Match system, this blog entry introduces the second core technique: face tracking with shot boundary detection.

Shot boundary detection

What is shot boundary detection?

A video is usually composed of hundreds of shots strung into a single file. A shot is composed of continuous frames that are captured in one camera action. Shot boundary detection is used to locate the accurate boundary between two adjacent shots. There are several kinds of boundaries between two adjacent shots, but they can generally be categorized into two types: abrupt transition (CUT) and gradual transition (GT). CUT is usually easy to detect since the change on the boundary is great. Considering the characteristics of different editing effects, GT can be further divided into dissolve, wipe, fade out/in (FOI), and so forth. For GT, there is a smooth transition from one shot to another, which makes it more difficult to determine the position of the boundary. Additionally, it can be difficult to tell the difference between GT and fast movements in a single shot, since the variation of content in both cases is smooth.

Why is shot boundary detection needed?

Shot boundary detection is widely useful in video processing. It is a preliminary technique that can help us to divide a long and complex video into relatively short and simple segments.

In Face Match, shots are the basic units for face tracking as they provide an effective tool to restrict a face track that may drift across multiple shots.

How do you achieve shot boundary detection?

Three steps are required for shot boundary detection:

1. Extract features to represent the video content.

To find the shot boundary, the video is analyzed frame-by-frame. The color vector composed of color values of all pixels in a frame is not good enough to determine a shot change since it’s very sensitive to movement and illumination. Therefore, histogram features for both colors in the HSV color space and textures with the local binary pattern (LBP) descriptor are extracted. LBP reflects a local geometric structure and is less sensitive to variations in global illumination.

2. Compute the measurement of continuity.

Continuity measures the similarity between adjacent frames. On the shot boundary, the continuity should have a low value. Using this measurement, content within a video can be transformed into a one-dimensional temporal signal. If the measurement is only associated with two adjacent frames, it is hard to detect the GT since the variation between two adjacent frames is small. Thus, a larger time window is used. In this window, K frames lay along the time axis. See Figure 1 below for an example. Their all-pair similarity can be computed. A graph can be constructed by these K frames with K*(K-1) edges valued by the similarity, as demonstrated below. We’ve adopted histogram intersection as the similarity measure, weighted by the distance between two frames in a pair.


Figure 1. The graph with K*(K-1) edges (only part of edges are shown) and the K*K weight matrix

The normalized cut CN of this graph is calculated as the continuity value of the middle frame in this window where


Since color and LBP histograms are both employed, two curves can be obtained. The results are combined by multiplying them together.

3. Decide the position (and type) of the shot boundary.

There are two approaches to determine the shot boundary. The first uses a pre-defined threshold to classify the curve into two categories. The second relies on machine-learning techniques to train a classifier. As we lack enough training data, we selected the first approach.

Face Tracking

What is face tracking?

Face tracking is the tracking of the human face in a video or a continuous image sequence from a start point (with parameters such as position, scale, rotation, expression, etc.) given by face detection and even face alignment techniques (Figure 2).

Face tracking may be implemented online or offline. In online mode, a face is tracked while the video is being captured. Thus, only current and previous frames can be used to exploit information for tracking and the efficiency requirement is strict. In offline mode, the whole video file is generated ahead of time. Therefore, the information of any frame can be used to guide the tracking.

In Face Match, since the video has been obtained beforehand, we implement tracking in offline mode, and only the position and scale of the face are concerned.


Figure 2. Illustration of face tracking

Why is face tracking needed?

Video is generally composed of tens of thousands of frames. To find as many faces as possible in each frame, one option is to perform face detection frame-by-frame. Given that it takes 0.3 seconds for a frame sized 640×360, processing a video is more than eight times slower than video playback. Thus, it is not feasible in practice.

Considering the continuity of video along the time axis and the redundancy between adjacent frames, face tracking can be employed instead of face detection in each frame. Since face tracking is very efficient, the time cost can be significantly reduced. Moreover, the faces of the same person in consecutive frames can be linked together. Thus, for each face track, only representative face samples are needed in subsequent face clustering or tagging steps, which can dramatically decrease processing time. Moreover, face tracking can help recover more difficult-to-detect faces.

How do you achieve face tracking?

There are several mature standard models designed for object tracking, such as optical flow, mean shift and particle filter. Considering the efficiency of processing thousands of videos, we adopted the optical-flow based tracker. In order to do so, we follow the Kanade–Lucas–Tomasi tracker, which is based on the object appearance and nonlinear, least-square optimization. If the appearance of the object changes only slightly over time, the tracking performance will be very good. It’s also able to handle many motion parameters in addition to transition and scale, 3D rotation angles and expression parameters (e.g. Active appearance models). By adopting inverse compositional techniques, the solving process of optical flow is very efficient.

Optical flow based tracker makes use of continuity of adjacent frames with three assumptions:

  1. Assume the appearance of the target object is similar or the same in adjacent frames
  2. Assume the target object should have abundant texture
  3. Assume the variation of pose parameters (translation, scaling, rotation) should be small

For face tracking in a video stream, these three assumptions are usually satisfied.

Given a face box in the first frame, optical flow minimizes the appearance difference between face areas in adjacent frames to find the best face box in the next frame. The parameters to describe a face box in our application are translation and scale. To solve a non-linear, least-square problem, the parameters can be obtained iteratively. Some further considerations are:

  • To alleviate the sensitivity of illumination, we normalize the intensity of gradients  fm_equ3_3 as appearance descriptor fm_equ3_4, since it is also simply computed. The original intensity of gradients is normalized by a sigmoid function to limit its dynamic range in [0, 1].


  • To cover large displacement of face both in and out of the image plane, a multi-resolution strategy with pyramid structure is employed.
  • Two-step tracking strategy is proposed: 1) track only the translation of the face area using pyramid structure; 2) track translation and scale synchronously in single resolution.
  • To avoid the track as it drifts into background, an online learning model is adopted in the second step above. Each pixel in the appearance of face area is modeled as Gaussian distribution with the mean and variance updated during the tracking. If the track error is greater than a pre-defined threshold, the tracking is terminated.

The preprocessing is face detection and shot boundary detection. Face detection provides a start for face tracking and shot boundary detection limits the face tracks laid in the same shot. Thus, before tracking, in each shot, there are several detected face boxes in different frames. We iteratively associate the detected faces into longer tracks and extend the connected tracks with further tracking. This finishes the step of tracking.

This is the 3rd blog of Face Match tech blog series. You can browse the other 3 blogs in this series by visiting:

1. Face Match System – Overview

2. Face Match System – Face Detection

4. Face Match System – Clustering, Recognition and Summary

Face Match System – Face Detection

May 3rd, 2014 by Tao Xiong

Following the workflow of the Face Match system, this blog entry introduces the first core technique: face detection.

Face detection

How does the system identify which faces to detect?

Face detection is an essential step in face tagging. The detection rate strongly correlates with the final system recall of faces. We take careful steps to detect frontal faces as well as profile faces because the latter are indispensable for recalling whole-profile face tracks, which are abundant in premium videos. The detection of rotated faces is also a necessity. See Figure 1 below for an illustration of face poses we strive to detect with our detector, where yaw refers to different profile degrees ranging from -90 to 90 degrees, and rotation refers to in-plane rotation from -90 to 90 degrees. We do not run the full range of in-plane rotation due to efficiency considerations.


Figure 1. Out-plane and in-plane rotations of human face

Incorporating such variances in the detector complicates the architecture design. We need to carefully design the algorithm and parameters to achieve balance among accuracy, false detection rates and running speed. Please remember that the detector is the most time-consuming feature in the whole system.

Building a multi-view face detector

Face detection is a well-studied problem with a long research tradition. The state-of-the-art detector follows the sliding window approach to exhaustively scan all possible sub-windows in one image, and it relies on cascade-boosting architecture to quickly filter out negative examples. See Figure 2 for an illustration of the cascaded classifiers. Each stage (denoted as 1, 2, 3, etc.) is a classifier, which scores the sub-windows. The windows with scores below a certain threshold are discarded and only those with larger scores are passed. Thus with carefully designed classifiers, we can safely filter a certain portion of negative examples without falsely rejecting many truly positive examples. Though the number of sub-windows is huge for an image, most of the windows are negative examples and will run through only one or two stages. Thus the process is quite efficient for a single-face pose.


Figure 2. Cascade classifiers

However, parallel processing the different detectors for multiple poses ignores the structure of the face pose space and is inefficient. To facilitate the feature- and detector-sharing among different poses, various hierarchical detector structures have been proposed and implemented. We chose the pyramid structure for its simple and independent training process for the underlying component detectors. Pyramid structure is a coarse-to-fine partition of multi-view faces. See the following Figure 3 for an illustration of the yaw based partition process.


 Figure 3. Partition process of yaw angle

Our situation is a bit more complex since we need to deal with in-plane rotation and yaw rotation at the same time. Thus a branching node is needed to decide whether a given example will go to the in-plane rotation branch or the yaw rotation branch (Figure 4). More specifically, we train a five-stage all-pose face/non-face detector as the root node. Then we train two detectors for in-plane rotation and yaw rotation respectively, each with ten stages. The outputs of these two detectors are compared to select a subsequent branch hereafter. After that, the problem is converted to the solved problem of rotation in one dimension, be it in-plane rotation or yaw rotation. In a given branch, the same coarse-to-fine strategy is used. The final output incorporates both face position and pose estimation.


Figure 4. Whole face detector structure to handle multi-view faces

Usually for face detectors, Haar wavelet features are typically used in face detection because they are simple and fast. However, they often contain dimensions ranging in the tens of thousands. In contrast, the local binary pattern (LBP) feature is only a 58-bin sparse histogram. It captures the local image’s geometric structure and is less sensitive to global illumination variations. Thus, we’ve adopted the LBP histogram for our system.

We’ve also integrated the boosting framework for training the classifier stages. We use a RankBoost like reweighting scheme in each round to balance the weights for positive and negative examples. This is useful to tune the classifiers to focus more on the limited positive examples. We also follow the nested cascade structure to further reduce the number of weak classifiers needed in the detector.

Synthetic examples like flipped, rotated version of faces with random small positions, scale and rotational transformations are created to enlarge the face dataset. In training, multiple threading techniques make the process more quickly.

Our multi-view face detector can detect faces in about 300ms for 640×360 images. The accuracy is about 80 percent for frontal faces and 60 percent for profile faces, both at 5 percent false detection rate.

This is the 2nd blog of Face Match tech blog series. You can browse the other 3 blogs in this series by visiting:

1. Face Match System – Overview

3. Face Match System – Shot Boundary Detection and Face Tracking

4. Face Match System – Clustering, Recognition and Summary

Last comment: about 10 hours ago 1 Comment

Face Match System – Overview

May 3rd, 2014 by Zhibing Wang


We must confess: sometimes even we have a hard time recognizing actors in TV shows and movies. Sometimes the name is right on the tip of our tongues, but we still don’t know it. It’s even more difficult with some foreign actors. But, if there was a way for a video to provide detailed metadata about an actor whenever he or she pops up in a video, Hulu users would have the benefit of having that information displayed from the Hulu player, with the option to learn more about the actor they’re interested in whenever they wanted.

From another point of view, general multimedia content analysis remains an unsolved problem — even with significant progress made in the past 20 years. However, unlike general content analysis, face-related technologies like face detection, tracking and recognition have recently matured into consumer products. The combination of these types of advances in technology with our relentless pursuit to enhance the user experience at Hulu, is where the idea of “Face Match” originated.

System design

When first examining the problem, one solution would be to examine all frames of the video and use human effort to exhaustively annotate all the faces that appear in these frames. However, this method would not be scalable for billions of videos on the Internet. Another extreme would be to detect faces in each image and let an algorithm automatically detect and identify the faces. However, the bottleneck of this approach is that current recognition algorithms can only achieve approximately 80% accuracy at best — which is far below the minimal user expectation. Taking both of these methods into account, it became apparent that the best solution would be to combine the merits of each and find a way to minimize the human effort to the lowest level.

Our system was designed to carefully balance the computational complexity while also minimizing human effort. As shown in Figure 1, the Face Match platform contains two main parts: the initial segment and the auto-tag-cluster-confirm cycle. For each video, faces are detected and grouped into face tracks/groups. The details of these technologies are described in the next paragraphs.

To minimize the amount of human effort required to label each individual face, visually similar face tracks are grouped via clustering. Thus, a human can select a number of face tracks at a given time and label all of them in one fell swoop.

For each show, the system first collects celebrity information from the web. Then, for initial videos in each show, 20 percent of face tracks are clustered and left for manual labeling. These bootstrapped celebrity labels are helpful in supervised track tagging. Though all face tracks can be clustered and simply left for manual labeling, this leads to a heavy workload. To improve the efficiency of human annotation, we’ve introduced an auto-tag-cluster-confirm cycle. With the bootstrap labels, the system can learn predictive models for celebrities. The models predict unlabeled tracks that are left for human confirmation. As the pool of celebrity labels grows with each iterative cycle, the system is able to learn face models with better precision. In the front end, displaying a large number of a celebrity’s face tracks for manual confirmation would be inefficient since a human still needs seconds to verify each face track. Similar to the initial annotation process, the system also clusters visually similar face tracks together. Thus, humans can confirm a number of tracks in one simple click, with one quick glance.

fm_fig1Figure 1. Overview of the system design. A.) Face groups/tracks are detected and extracted for each video; B.) For each show, celebrity information is collected and the initial pool of face tracks (20 percent) are clustered for bootstrap labels by user annotation; C.) Automatic face track tagging is introduced in auto-tag-cluster-confirm cycle to minimize the human effort.

To detect and connect faces and place them into tracks, we leverage face detection and tracking algorithms. We’ve also trained a multi-view face detector for 180-degree plane rotation and 180-degree yaw changes with about ten thousand labeled examples. The face detection algorithm needs roughly 300 milliseconds to process a 640×360 frame (running on PC with Xeon(R) CPU E5420 at 2.50GHz). Thus, detecting all video frames would consume nine times the amount of real-time processing — which is unacceptably slow. Our tracking system rescues us from such heavy computation and associates isolated faces into continuous tracks. It can also extend face tracks to frames well beyond the initial detection result, which increases the whole system recall at moments when the detector fails to detect existing faces. This also effectively reduces the number of candidates for later face tagging by a factor of 100. As a result, we only need to tag face tracks and not isolated faces. To avoid the “drifting away” phenomenon in tracking, shot boundaries are detected and incorporated as well.

In automatic track tagging, we also take advantage of the multi-sample (multiple faces per face track), multi-view features (clothes, hair, facial and contextual features). As in Figure 2, where the pipeline of automatic track tagging is shown, the system first builds databases with annotated tracks. Then for each face track, the system extracts multi-view features for all samples in the track. And, for each face and each feature, the system finds its nearest neighbors via ANN and decomposes it as a linear combination of its neighbors. Finally, the identity coefficient distributions for all faces and all features are aggregated for the final tagging of this track. For the details of the algorithm, please refer to the Track Tagging section.

fm_fig2 Figure 2. Algorithm pipeline for the track tagging method

Processing pipeline

As shown in Figure 1a), when a video is about to be released, the system first determines the shot boundaries and densely sample frames to apply the multi-view face detector to each frame. The detected faces provide a starting point for face tracking. Then, tracking algorithms associate the isolated single faces into connected face groups of the same person. With the extracted tracks, clustering algorithms group similar tracks for user annotation. Or, a track tagging stage can also automatically tag actor candidates on each track for user confirmation. Finally, the face tracks are left for human annotation or confirmation.

Combining these steps altogether, we can automatically tag tracks of one video in real-time. For a common TV show, 80 percent of the face tracks can be successfully picked out for processing with 5 percent false positive samples. After that, we would still require human intervention to verify the final results.

In the next three blogs, four core techniques — face detection, face tracking with shot boundary detection, face track clustering, and face track recognition — will be introduced. The annotation step is omitted since this blog covers only technical algorithms.

Last comment: about 8 hours ago 3 Comments

BeaconSpec: A domain-specific language for generating MapReduce jobs

April 10th, 2014 by Prasan Samtani

Perhaps of all the creations of man, language is the most astonishing

- Giles Latton Strachey, Words and Poetry


Hulu viewers generate a tremendous amount of data: our users watch over 400 million videos and 2  billion advertisements a month. Processing and analyzing that data is critical to our business, whether it is for deciding what content to invest in in the future, or to convince external partners on the superiority of Hulu as an advertising platform. In order to ingest, process and understand that data, we rely on a framework called Hadoop (http://hadoop.apache.org/) – an open source project that allows for the storage and distributed processing of large data-sets.

The two major components of Hadoop are a distributed file system (HDFS) and MapReduce – a framework for distributed computing over the cluster. In order to understand the rest of this blogpost, it is important to have a basic understanding of how MapReduce works.

When authoring a MapReduce job, programmers specify two functions – a map function that processes each chunk of data to produce a set of key-value pairs, and a reduce function the merges all values associated with the same key (1). A surprising number of conventional algorithms can be expressed in this manner and thereby be converted into programs that can run in parallel. This is best understood with an example.

Let us imagine that we are trying to compute the total number of minutes watched for a given set of videos. Each video has a unique identifier, and what we would like at the end of this computation to have a table in the following form:

Video identifier
Minutes watched
1 25
2 34
3 126
4 5

In a distributed file system such as HDFS, the raw data will be stored on several separate computers, or nodes. A naive way to compute this data would be for the program to serially request the raw data from each of the nodes, and perform the computation. Such an approach works fine for small data-sets such as this example, however we want an approach that scales to very large data-sets. For example, for any typical hour, we have about 50 GB of raw data on our Hulu cluster. Clearly, sequentially processing all this data would take a very, very long time and consume a lot of network resources. Using MapReduce, the mapper process runs on each node in parallel, generating a key-value pair consisting of the video identifier (the key) and the minutes watched (the value). Illustration 1 shows what the output of the mapping phase would look like if the data was distributed across three nodes.




From here, we go into the reduce phase, where all outputs with the same key are guaranteed to be sent to the same reducer. The reducer then computes some function over the intermediate output to produce a final output. In this case the function is summation, as all we want is the total number of minutes per video. Illustration 2 shows what this would look like if this processing ran on two reducers.




Event data from our various Hulu players is encoded in the form of beacons. A beacon is just a URL-encoded description of the event, as shown in the example below:

80    2013-04-01    00:00:00    /v3/playback/start?bitrate=650&cdn=Akamai&channel=Anime&client=Explorer&computerguid=EA8FA1000232B8F6986C3E0BE55E9333&contentid=5003673

The Hulu players on all our devices are sending a constant stream of beacons to our servers, those beacons are subsequently stored onto HDFS where they can be processed by our MapReduce jobs.


So far, we’ve seen how we can transform a conventional single-threaded computation into a MapReduce computation by specifying a mapper and reducer function. We’ve also seen how we encode events as beacons, and collect them onto our distributed file system, HDFS. However, in our experience, writing MapReduce jobs by hand is both tedious and error prone, although like most skills, you get better at it with practice. Additionally, the resultant code contains a significant amount of boilerplate, making the logic hard to see at first glance. The latter is in practice the most significant impediment to overall system understandability and debuggability. Since we run many (on the order of 150-175) different types of MapReduce jobs every hour, we wanted a solution that would allow us to encode the logic in a straightforward way that was easier to maintain than hand-written Java code.

We started by looking at our code and realizing that the majority of our MapReduce jobs perform very similar functions – selecting a set of dimensions that we care about (for example the video identifier, the zipcode, etc), performing some lookups against meta-data tables in our key-value store (for example deriving the zip code dimension from the IP address of the request) and aggregating over a corresponding fact (for example, the total minutes watched). We realized that we could embed the basic knowledge of how to do these things in a language, and then simply use programs written in this language to generate our MapReduce code for us. Since internally we refer to the events we receive from our players as beacons, and the purpose of this language was to process raw beacon data, we called the language BeaconSpec.

An example of BeaconSpec code is shown below:

basefact playback_start from playback/start {
    dimension harpyhour.id as hourid;
    required dimension video.id as video_id;
    required dimension contentPartner.id as content_partner_id;
    required dimension distributionPartner.id as distribution_partner_id;
    required dimension distributionPlatform.id as distro_platform_id;
    dimension distributionPlatform.isonhulu as is_on_hulu;
    dimension package.id as package_id;
    dimension package.isplusbypackage as is_plus_by_package;
    dimension plan.id as plan_id;
    dimension plan.isplusbyplan as is_plus_by_plan;
    dimension plusCategory.pluslowercategoryid as plus_lower_category_id;
    dimension plusCategory.plusuppercategoryid as plus_higher_category_id;
    dimension client.out as client;
    fact sum(count.count) as total_count;
    dimension packageAvailability.chosen as package_availability;
    dimension siteSessionId.chosen as site_session_id;
    dimension facebook.isfacebookconnected as is_facebook_connected;

There are a few key points to note – first, there is no specification of how to compute the final result, only a declaration of what we would like to compute. It is the role of the BeaconSpec compiler to take this declarative specification and convert it into imperative MapReduce code that can run on the cluster. Second, there are a lot of special keywords that have meaning for us – the keyword basefact denotes a metric that we want to measure, the keyword from tells us which source beacons we need to process in order to compute the metric, the keyword dimension denotes a particular dimension of the source beacon that we care about and that should form a part of the intermediate key out of the Mapper phase, and the keyword fact denotes a dimension of the source beacon that we want to aggregate over (in this particular example, we are performing a summation, as the fact keyword is immediately followed by the sum specifier – we could just as easily calculate an average [avg] or take the maximum value [max]).

From this specification, our in-house compiler produces runnable Java MapReduce code, a snippet of which is shown below:

Screen Shot 2014-04-10 at 12.28.34 PM

We use a variety of open-source technologies in order to build our compiler – in particular JFlex for lexical analysis & CUP for parser-generation. These are the Java cousins of the old C programs you probably used if you’ve ever taken a compilers class – lex and yacc.

A major advantage of a formal declarative specification of our process is that it allows us to extend functionality far beyond what we initially planned. For example, we are currently in the process of building out a program whose purpose is to validate beacons sent by implementations of our Hulu player on a range of devices. For this purpose, we can use BeaconSpec as an input to the validation program, which will subsequently examine incoming beacons and compare them to the specification, and send us reports about whether the incoming beacons match or deviate from the specification. As another example, as we move towards real-time processing of our incoming data, we are examining the possibility of creating a secondary code-generator for the BeaconSpec compiler that will output code to run on Apache Storm instead of MapReduce.

Related work

Apache Pig is a project that has similar goals to BeaconSpec. Pig programmers write their scripts in a language called Pig Latin, which is subsequently compiled into MapReduce code. However, unlike BeaconSpec, Pig is both an imperative and general purpose language. We feel that for this particular use case, the advantages conferred by a declarative domain-specific language are too great to consider abandoning them for a general purpose language. An imperative general-purpose language cannot avoid introducing boilerplate and insignificant details cause the final program to be significantly less clear than what we could achieve with a declarative domain-specific language.

Summingbird is a project at Twitter which can generate code targeting MapReduce, Scalding or Storm. It offers a powerful set of abstractions for performing aggregations. An example of the canonical word-count program in Summingbird is shown below:

def wordCount(source: Iterable[String], store: MutableMap[String, Long]) =
   source.flatMap { sentence =>
     toWords(sentence).map(_ -> 1L)
   }.foreach { case (k, v) => store.update(k, store.get(k) + v) }

For an example of the equivalent code written directly in Java MapReduce, see this link. Summingbird was written to solve similar problems to the ones that led us to create BeaconSpec, and we believe it does so in a way that is significantly more expressive than Pig. However, as it is written in a highly idiomatic style, and learning to write Summingbird programs has a steeper learning curve than BeaconSpec.

A few other languages have emerged around the Hadoop ecosystem, such as Kiji (which offers a table abstraction and several insertion/retrieval operators over HBase), and Hive (HiveQL) – which offers a subset of relational database operators that are compiled to MapReduce. We have not fully explored Kiji, however, we make heavy use of Hive at Hulu.

Sawzall (developed at Google) can arguably be acknowledged as the progenitor of all these languages, it is another general-purpose data processing language designed to compile to MapReduce code on Google’s proprietary distributed data processing platform. A link to a paper on it can be found here.


The key takeaway is that we don’t want a general purpose language, we want a language that expresses exactly what we care about, and suppresses details that are not central to the task (2). Whenever you are working on a DSL, adding general purpose features to the language is a serious temptation, but one that must be avoided if you don’t want your project timeline to rival that of Duke Nukem Forever. This sentiment is best captured by the pioneer computer scientist Alan Perlis in the following quote:

Beware of the Turing tar-pit, in which everything is possible, but nothing of interest is ever easy.

-Alan Perlis, Epigrams in Programming



  1. MapReduce: Simplified data processing on large clusters (Jeffrey Dean and Sanjay Ghemawat, Google Inc., 2004) http://static.googleusercontent.com/media/research.google.com/en/us/archive/mapreduce-osdi04.pdf
  2. Structure and interpretation of computer programs (Hal Abelson and Gerald Jay Sussman, MIT, 1984) http://mitpress.mit.edu/sicp/
  3. Epigrams in Programming (Alan Perlis) http://www.cs.yale.edu/homes/perlis-alan/quotes.html
  4. Interpreting the Data: Parallel Analysis with Sawzall (Pike et al, Google Inc.) http://research.google.com/archive/sawzall.html

Categorizing Customer Support Contacts with Machine Learning

December 17th, 2013 by Chris Liu


One of the cool things about being an intern at Hulu is that we are given the freedom to creatively apply our computer science knowledge to everyday problems. Our customer support process generated such a problem: we’ve got a stream of incoming customer support emails, and would like to automatically assign textual tags to them based on their content. This is a machine` learning problem of text classification, and one that we tackle below.

A binary text classification problem is to learn a rule that classifies a concept given a training set. The training set consists of pairs latex path not specified.: (x_i, y_i) , latex path not specified.: x_i being a text from some instance space latex path not specified.: X , and latex path not specified.: y_i \in (0,1) being a binary label. We assume there exists a true decision function latex path not specified.: f: X \rightarrow (0,1) where latex path not specified.: f(x_i) = y_i . Our goal is to find a hypothesis latex path not specified.: h: X \rightarrow (0,1) that best approximates latex path not specified.: f .

For example, if we take our instance space to be the set of all sentences in the English language, and latex path not specified.: y_i = 1 to mean the sentence has positive sentiment, then our problem is to find a classifier that classifies the sentiment of English sentences. This example already illustrates problems we will face in the future — how do we reconcile the fact that the set of all sentences is potentially infinite? How do we represent our sentences in a way that suits learning best? And, most importantly, how do we best learn the true hypothesis latex path not specified.: h from just our training examples?

Hulu gets thousands of customer support emails daily. Each email is represented by a ticket and associated with a set of concise descriptors called tags. With a well-performing tag classifier, the benefits are tangible — we can save our Customer Support Advocates (CSAs) the need to manually tag tickets, and we can route tickets to specific groups of CSAs who specialize in a particular subject area. In general, we would be able to streamline the support experience.

Our problem comes with a unique set of challenges. We would like our service to run on Donki, Hulu’s internal service hosting platform. Since we’d like to limit memory usage, we’d like to keep our approach bounded to a somewhat conservative 512mb. The size of our training set may be very large, and potentially not bounded. With these challenges in mind, we will describe the approach taken, what worked, what didn’t work, and the end result.


Given our wish for the service to run on Donki, Python seems like the natural choice. The NumPy/SciPy/Scikit-Learn stack is an accessible, open source, and comprehensive set of libraries that suffices for our purposes.

As with any Machine Learning problem, feature extraction, feature selection and model selection is crucial. Feature extraction decides which tickets are suitable for training, and extracts the useful parts of suitable tickets. This is the process of canonicalizing a ticket. Feature selection turns the canonical ticket extracted above into a training set. Model selection turns the training set into a classifier.

Feature Extraction

We would like our canonicalized ticket to be faithful to the eventual prediction context. This means, for example, that the CSA’s response to the ticket should not be included. Fortunately, all changes to a ticket are recorded in chronological order, so we are able to extract only the text and tags relating to the customer’s first contact.

Feature extraction also relies upon heuristics and domain-specific knowledge of the problem. Some additional tickets fields are also of value. We take the operating system field, the subject field, and whether the ticket was created by a Hulu Plus subscriber. For example, tickets from Hulu Plus subscribers are more likely to be about billing. To incorporate these additional fields, we append them to the existing text. For example, if a user is a Hulu Plus subscriber, we append the word “IsPlusSub”.

Sometimes, manually viewing the tickets and the extracted result is the only way of gaining insight. One particular issue was with users replying to emails would bring in a large amount of highly suggestive words. For example, an automated email from Hulu contained a link allowing users to “unsubscribe” to these emails, which was highly suggestive for certain tags. A heuristic was utilized to filter out all parts of the reply parts of the text. Another amusing example was that certain Hulu emails contained the words “Vizio”, “Samsung”, “Sony”, and other TV platforms on which Hulu Plus is available. An pre-existing rule would activate at these keywords, giving these tickets a large number of irrelevant tags.

Feature Selection

As we hinted before, our instance space latex path not specified.: X where we draw tickets from may be infinite. One common approach to this problem is the “bag of words” representation. We simplify our instance space from the set of all ticket text to the set of all “suitable” words, which is always finite. The “suitable” criteria is another heuristic, but intuitively, we want to exclude common but meaningless words. For example, the word “I” or “we”, or, in our case, “Hulu”, is meaningless with regards to the tags of a message. Excluding these words assists our model in finding meaningful relationships.

Our revised instance space is often called the vocabulary of our examples. Another optimization we can make is to expand our vocabulary using the technique of “n-grams”, which is a contiguous sequence of words of length n. n-grams can potentially capture relationships between words. For example, a ticket with words “not cancel” and another with “cancel not” would be equivalent in a vocabulary consisting solely of words. However, a 2-gram will capture this difference, since for the first ticket, the vocabulary will contain latex path not specified.: (not, cancel, not \;cancel) , while the second will contain latex path not specified.: (cancel, not, cancel\;not) .

In order to use the bag of words representation, we need words. This involves pre-processing and tokenizing the canonicalized tickets. We settled on the n-gram range of latex path not specified.: (1,2) through experimental testing. This process results in very high dimensional data, since each word and each 2-sequnce of word is a dimension. We attempted dimensionality reduction techniques, to combat the so called “curse of dimensionality”. Dimensionality reduction techniques such as Principal Component Analysis and Latent Semantic Analysis attempt to find linear combinations of data that best correlate. Exploration on this front produced some graphics.



The semantic meaning of an “account_billing” tag and an “account_login_activation” tag is quite similar. Both have to do with a user’s account problems. Hence, when we projected the example down to its two most “information preserving” dimensions, we do not see clear separation. However, “account_billing” and “technology_playback” tags have quite distinct meanings from each other, and we see clear separation after dimensional reduction. Just to confirm our intuition on this, we plot “android” vs “ios”, two tags with distinct meanings, and get the following:


It’s always nice when mathematics are in line with our intuition. While these techniques were not ultimately used due to scalability problems, the ability to visualize and confirm our suspicious regarding separability of tags was highly valuable.

For each document, we want to create a corresponding bag of words representation. The simplest approach is to scan over the set of all documents (also called the corpus), and build a dictionary consisting of the entire vocabulary. If the length of this dictionary is latex path not specified.: n , then a vector latex path not specified.: v of length latex path not specified.: n models a document, where the latex path not specified.: ith index represents an element of the vocabulary, and the value at the latex path not specified.: ith index is its frequency in the document. One common technique on top of frequency count is the “inverse document frequency” re-weighing, which intuitively says frequent words in the corpus are inherently less meaningful.

The above described technique is often referred to as term frequency-inverse document frequency vectorizing. The technique requires the entire vocabulary in memory, rendering it stateful. So we explored alternatives for scaling purposes. We settled upon using what is called the “hashing trick”. Very simply, instead of building a vocabulary by scanning the corpus once ahead of time, we explicitly pre-define the output space to be a high dimensional vector space. Then for each word and n gram, we hash and increment that index of the output vector. Because collisions are unlikely in high dimensional space, the hashed vector is a faithful representation of the original text. The one thing we gain in this case is statelessness, so we can use this technique on batches of documents. A brief survey of papers on this subject (here, and here) shows the hashing trick has both theoretical and empirical support.

Model selection

In order to select a model, we need a way to calculate performance and a dataset to calculate performance on. Being a binary classification problem, performance is defined as counts of true positives, true negatives, false positives, and false negatives. How we view these numbers is interesting and worth discussing. In our case, two relevant metrics are “precision” and “recall”. Precision measures the ratio latex path not specified.: \frac{ \text{true\ positive} } {\text{true\ positive} + \text{false\ positive}} whereas recall measures the ratio latex path not specified.: \frac{\text{true positive}}{\text{true positive}+ \text{false negative}} . Achieving a high precision means that when we do predict the positive label, we are mostly correct. High recall means we are correctly predicting most of the positive labeled example positive. High precision usually comes at the cost of low recall, and vice versa. For example, if a model does not predict positive on any example except the very few that it’s very sure of, it achieves high precision, but low recall. If a model predict positive on every label, it achieves high recall but low precision.

Because our tool will be used as a pre-processing step in the handling of a ticket, we decided that high precision is more valuable than high recall. We would rather not have the classifier make a positive prediction, unless it is reasonably sure that this prediction is correct.

To get a dataset for testing, we employ cross validation and use a test set. In cross validation, we split the training set into latex path not specified.: n folds, and for latex path not specified.: 1 \leq i \leq n , we train on all folds other than the latex path not specified.: ith fold, and do scoring on the latex path not specified.: ith fold. We make a test set by holding back a ratio of the training set from training, and to calculate performance on that set after the model is trained. Both methods ensure the set we are scoring on does not participate in the training of the model.

The overall metric we would like to optimize is the latex path not specified.: F_1 score, which is the harmonic mean of precision and recall. It can be calculated as latex path not specified.: 2(\frac{\text{precision} + \text{recall}}{\text{precision}\ \cdot\ \text{recall}}) . We survey a broad variety of algorithms for classification, including: Naive Bayes, K-Nearest Neighbor, Support Vector Machine, Stochastic Gradient Decent, Decision Tree, and Random Forests, and AdaBoost. Unfortunately, the Decision Tree, Random Forests, and AdaBoost classifiers cannot handle sparse data, so they were immediately disqualified.

The scores, training, and test times of our different classifier is listed below for a collection of 15,000 training and 6,500 test documents.

Classifier F1 Precision Recall Training time Prediction time
Linear SVM 0.590 0.703 0.509 68.5 s 0.43s
Stochastic Gradient Decent 0.587 0.689 0.512 33.2s 0.37s
Logistic Regression 0.556 0.718 0.454 31.8s 0.43s
Bernoulli Naïve Bayes 0.438 0.353 0.577 2.20s 1.76s
Multinomial Naïve Bayes 0.450 0.361 0.597 33.0s 0.40s
K-nearest neighbors 0.477 0.429 0.538 n/a 446.23s

In the end, the Stochastic Gradient Decent classifier was chosen. It has the desirable combination of speed and performance, and can be tweaked to optimize for a variety of objective functions. Furthermore, it is an online algorithm, meaning the training process can be split up in batches, alleviating memory concerns.

Odds and Ends

There are many tags that are simply unpredictable. Text classification relies upon the label having a clear and distinguishable meaning. For example, we found tags that indicate an action such as giving a customer a credit performed very poorly, since it is hard to gauge from the contents of the text whether a credit will be given. As such, we’ve implemented an automated pruning system into our classifier. We score each tag’s classifier at the end of training on the test set, and if that tag’s performance does not satisfy some minimum, we do not ever predict true for that tag.

Secondly, we may want to partition tags into classes, where at most one tag can be predicted per class. For example, we may wish to avoid the many device tags problem described above and enforce a maximum of 1 device tag per ticket. To solve this problem, we appeal to geometry. Any linear classifier is a hyperplane in the feature space, and can be represented as a weight vector, latex path not specified.: w . Predictions for an incoming vector latex path not specified.: x is made via latex path not specified.: sign(w \cdot x) , since any positive number means this example lies on the positive label side of the hyperplane. However, if we have a tag class latex path not specified.: C , and a set of classifiers latex path not specified.: (w_i\;:\; i \in C) , then we can output a prediction latex path not specified.: \arg \max_i (w_i \cdot x) , provided latex path not specified.: \max_i (w_i \cdot x) > 0 . This is commonly referred to as the one-vs-all method of extending binary classifiers to predict on a tag class latex path not specified.: C .


Applying machine learning techniques to real world data is tricky, exciting, and rewarding. Here, we’ve outlined some of the decisions we’ve made, approaches that didn’t work, ones that did, and how the steps of feature extraction, feature selection, and model selection were performed. In the future, we may expand into classifying phone transcriptions, and personalize the classifier for individual CSAs, to further optimize the automation within our customer support platform.

Chris Liu is a software developer intern on our customer platform team.

Hulu iPad redesign: lessons and code

December 11th, 2013 by Bao Lei

Earlier this year we launched a brand new revision of our iPad application. With lots of new features like our show and video discovery panel, mini player, etc, we had to think very hard about speed and stability of both the software we wrote as well as of the development process. With a few months behind us now, we wanted to introduce some of the interesting technical tidbits from our product.

Building the UI

At Hulu, we choose not to use the interface builder for our apps. Traditionally, we’d coded the positions and resizing logic for each view. While this is frequently a lot of labor, we felt it still left us with better control over the results than using a mouse to drag buttons. Using this approach you can keep your views under control when the view is simple, or you’re adding a few small features. With a big redesign, which included many different new pages and components, aligning everything manually with code becomes a lot less fun — not to mention challenging as the design is often tweaked after observing actual on-device user interaction.

There has a be a better approach, and the most interesting one for us was autolayout — which Apple introduced along with iOS 6. While awesome, we ran into two problems: (1) it works for iOS 6 and above, while we still support iOS 5; (2) it’s designed primarily for interface builder users, so it is little harder to use programmatically.

Then we looked at one small interesting part of the autolayout: the ASCII-art-like Visual Formatting Language (VFL). That eventually became our solution: we would use VFL, but not autolayout. Our solution is now released as an open source project called vfl2objc.

How does it work? We built a system to generate Objective-C code based on a VFL-defined layout. This allowed us to retain iOS 5 compatibility while keeping the layout instructions easy to read and easy to modify. The approach is performant because most of computations are done at compile time (except the magic behind the autoresizing mask, but that is relatively simple). It is also very flexible: you can call any sizing or positioning APIs before or after the VFL-generated code block — unlike autolayout, which ignores the frame you manually set, our VFL code generation tool won’t.

For example, you can go the loadView method of a view controller class, and add the following:

// VFL begin
// VFL end
Then, run our vfl2objc script. This will generate the following:

    // You need to predefine superview before this.
    CGRect frame;
    frame = label.frame;
    frame.origin.x = 0 + 30;
    frame.size.width = superview.bounds.size.width - (0 + 30) - frame.origin.x;
    frame.origin.y = 0 + 10;
    frame.size.height = 20;
    label.frame = frame;
    label.autoresizingMask |= UIViewAutoresizingFlexibleWidth;
    label.autoresizingMask |= UIViewAutoresizingFlexibleBottomMargin;
    [superview addSubview:label];
    frame = button.frame;
    frame.origin.x = 0 + 50;
    frame.size.width = 100;
    frame.origin.y = 0 + 10 + 20 + 15;
    frame.size.height = 50;
    button.frame = frame;
    button.autoresizingMask |= UIViewAutoresizingFlexibleRightMargin;
    button.autoresizingMask |= UIViewAutoresizingFlexibleBottomMargin;
    [superview addSubview:button];
And triggering the script, native UI alignment code will emerge between VFL begin and VFL end. They are inside curly braces so that you can fold them, and look at just the VFL part. If your design changes you can just edit the VFL definition and run the script again. Of course you can also automatically trigger the generation process with an Xcode pre-build script.

To learn more about vfl2objc, please visit http://github.com/hulu/vfl2objc

Defining Events

While NSNotification is something very handy, it is sometimes bug prone because of the nature of weak typing. For example, when you post a notification, the compiler cannot check if the notification name actually matches the name expected by the observers. Also any information passed along the notification is either a weakly typed object, or a dictionary. In order to reduce the incidence of bugs you need document the notification clearly, and always keep this documentation up to date even as your application changes. Decoupling documentation from implementation often leads to people updating one but not the other, thus becoming the source of problems.

To deal with this issue, we introduced a replacement for NSNotification — the HUTypedEvents. The basic idea is to use a protocol, plus a method signature, to represent an event. This allows the compiler to check both the event name and parameters.

For example, if we have an event PlaybackStart with integer parameters videoID and showID. If we were to do it with NSNotification, the code to define it will be:

const NSString *PlaybackStartNotification = @"PlaybackStart";
const NSString *PlaybackStartVideoIDKey = @"PlaybackStartVideoID";
const NSString *PlaybackStartShowIDKey = @"PlaybackStartShowID";
In order to post the event, somewhere inside the playback component we need to call:
[NSNotificationCenter defaultCenter] 
   postNotificationName: PlaybackStartNotification
   object: nil
   userInfo: @{PlaybackStartVideoIDKey: [NSNumber numberWithFormat:videoID], 
      PlaybackStartShowIDKey: [NSNumber numberWithFormat:showID]}];
Anyone who needs to handle this event will need to do:
// during init
[[NSNotificationCenter defaultCenter] 
   registerObservorForNotificationName: PlaybackStartNotification 
   selector:@selector(handlePlaybackStart:) object:nil];
// implement handler
- (void)handlePlaybackStart:(NSNotification *)notification {
int videoID = [[notification.userInfo 
    objectForKey:PlaybackStartVideoIDKey] intValue];
// some logic based on video ID
With HUTypedEvents, this is both simpler and safer. Create a singleton class, for example CentralEventHandler. Declare the event at the beginning of CentralEventHandler.h outside the class interface:
HUDeclareEvent(PlaybackStart, videoID:(int)videoID showID:(int)showID)
In the class interface, add:
HUDeclareEventRegistration(PlaybackStart, videoID:(int)videoID showID:(int)showID)
In the implementation for the class, simply add:
HUImplementEvent(PlaybackStart, videoID:(int)videoID showID:(int)showID)
Then, to trigger this event, call:
[[CenterEventHandler instance] 
And to handle this event, let the class conform to protocol EventPlaybackStart, and then do:
// during init
[[CentralEventHandler instance] registerEventPlaybackStart__observer:self];
// implement handler
- (void)handleEventPlaybackStart__videoID:(int)videoID showID:(int)showID {
// logic based on videoID and showID
This approach provides the following advantages:

  • When you trigger the event, you don’t have to look up or remember the event name, parameter names, or paramter types.
  • You don’t have to convert primitive types into objects.
  • Whenever you are triggering or handling your event, Xcode will provide autocomplete on the method and parameter names.
  • Whenever you register to receive an event, the compiler will check whether the class conforms to the right protocol, and will ensure that you implemented the right handler method with the right parameters and types.
  • If you want to find every use of this event — both where it’s fired, and where it’s handled — you can peform a global code search on “handleEventPlaybackStart”. The search results are much cleaner than when you attempt to do the same with NSNotification and search for PlaybackStartNotification.

To learn more about HUTypedEvents, please checkout http://github.com/hulu/HUTypedEvents

Text Rendering

We’ve upgraded GSFancyText, our open-source rich text rendering framework. We moved all the markup text parsing, line breaking, and alignments to a background thread, and only do the final drawing on main thread. Since our new application has a lot of scroll views that contain lots of images and rich text labels, this new approach is much more performant.

The library’s usage is similar to the older version, except that you need to explicitly call updateDisplay after a GSFancyTextView object is constructed. In case you need to align some other UI based on the size of a GSFancyTextView obtained from the asynchronous calculation, you can use a new method called updateDisplayWithCompletionHandler.

The vfl2objc system, HUTypedEvents, and the updated GSFancyText are just a few examples of integrating little, fun technical improvements into our big iPad app redesign project. During the process of building the Hulu iOS application, we are constantly looking for better ways to make it easier to build a nice looking, maintainable, and performant application. We’re happy to share some of our learnings and code with the wider development community.

Bao Lei is a senior software developer on the mobile team who works on our iOS platform.

Last comment: about 7 hours ago 1 Comment