• Hulu
  • TV
  • Movies
Search
Hulu Tech Blog

Creating Hulu VR

March 24th, 2016 by Editor

By Julian Eggebrecht, Vice President, Device Platforms

Hulu is passionate about the art of great storytelling and evolving that art of storytelling in TV and movies. Hulu sits right at the intersection of entertainment and technology, so it was only natural to explore how Hulu could bring the exciting virtual reality platform to viewers.

Two years ago a new Hulu division in Marin County, California was founded by a group of video game developers and hardware specialists that have created the Star Wars: Rogue Squadron franchise, the 16-bit classic Turrican games, and collaborated with Nintendo and Sony on the GameCube, Wii, and PS3 platform designs.

The team decided to build upon software previously developed using the Unity game engine. The team also had developed video streaming and user interface technology that Hulu has used for years, called CDP. It contains a player that can stream and play almost every video delivery format, combined with a UI system that provides HTML5 and JavaScript for layout and Web developer driven control. CDP also has special rendering features which make fluid 60 frames-per-second UI movement and advanced graphics effects possible. It exists for many devices and operating systems, enabling our developers to create one common app that runs everywhere.

For Hulu VR we deployed CDP not onto a real device sitting in a real living room, but onto a virtual TV located in the Unity game. This requires CDP to be running as a multi-threaded application in parallel to Unity, so that both can exchange their video output and control inputs. The CDP-rendered Hulu UI and video playback is displayed on a polygonal surface in the Unity game environment.

Using a finished game as the basis for our virtual world allows us to use typical game features that are important for immersion, not only presenting a static viewing environment as other apps do. A realtime renderer is used to have dynamic volumetric lighting, dynamic shadows, as well as animated objects. To top it off, we employ real-time physics-simulations to give the user Easter Eggs to discover and play around with.

In addition to playing our existing library in a multitude of virtual environments, the Hulu VR app also takes the plunge into full 360-degree film and video content that can only be really appreciated in a virtual reality headset. This presented the challenge of streaming full 360 degree videos, some even running at 60 frames-per-second, and being fully stereoscopic with individual frames for each eye. We knew from the outset that the videos would have to have a resolution close to 4K, but we also wanted to keep the bitrates for streaming somewhat manageable, considering our first device would be the mobile Samsung GearVR.

Two solutions were chosen to solve these problems: One being the first use by Hulu of the new HEVC video codec, which compresses video roughly twice as efficiently while maintaining the same quality as the older H.264 standard. The other being an optimization of the video data. Raw video material for VR is usually presented in the equirectangular format, which was used to unwrap the spherical earth onto the flat, rectangular surface of a map. We developed a new video conversion pipeline for deployment onto our servers, creating six images per eye in a special layout that then gets compressed. In the app itself, the viewer sits in the center of a virtual cube, and as each side of the cube is streamed in, the six images are projected into 3D space as cube maps which gives a perfect illusion of every pixel having an equal distance from the viewer.

Filmmakers say that audio is 50 percent of the viewing experience, and in VR the audio has to be dynamically placed around the listener depending on their head position. We chose multi-channel AAC as our streaming format, and use an advanced spatializer solution to position the audio correctly around the listener depending on their head position.

For the GearVR, we also took platform-specific steps to make our UI and the video content look as good as possible. Using an Oculus-specific feature called “Overlay”, we sidestep some of the resampling being applied on the 3D scene to compensate correctly for the lenses of the headset. Using the Overlay feature, we are able to stream and display Hulu’s regular library in our virtual environments in 720p HD resolution, and make monoscopic VR videos looking much more refined.

Near the end of the development cycle for the first version of the Hulu VR app, two issues needed resolution – performance and heat. To render the UI at speed, as the viewer is panning across the room and gets close to the Hulu UI, a custom depth-of-field shader coupled with a post-processing pass blurs the screen. While that happens the room is being rendered as a simple image and blurred – once this imposter is in place and the user looks at the UI, the 3D world behind it is switched off. Other tricks to gain performance involve dynamic shadows, which are only updated in areas the user looks at, and only when objects animate or are being manipulated by the user. Post-processing for HDR is only applied on certain parts of the 3D and we switch off Anti-Aliasing during VR playback to reduce system bus traffic. We also take full control of the CPU and GPU clock rates to run the CPU as low as possible to keep the heat down. When the user scrolls within the UI, or starts playing with physics-driven objects, we throttle the CPU up to provide enough power to sustain good frame rates.

With the first version of the Hulu VR app finished, we plan to add new VR shorts, films, and fantastic environments with new gadgets on a regular basis. The advantage of a cloud-based service is that shipping the first version of an app is only the foundation, and new content can be added at any time going forward. In the case of Hulu VR, this will mean new 2D shows, originals, movies, and trailers. We will add new 3D environments to watch this content in, and most excitingly more and more 360 degree VR filmed content from a wide variety of creators.

For now, we hope you enjoy this peek into the future – and we will get back to work on the next evolution of Hulu VR – which we are also hiring for at Hulu Marin, so please join the team!

Monaco – Efficiently Hosting 1.5TB of Redis Servers

October 6th, 2015 by Keith Ainsworth

Redis has become an indispensable technology at Hulu as we continue to scale and innovate new features and products. Redis is a lightning-fast in memory key-object database with a very light hardware footprint, which makes it ideal for building new projects. It was so ideal that we found, in our infrastructure, dozens of virtual machines (VM’s) and bare-metal servers dedicated entirely to hosting a redis-server.

Clearly Redis is an easy choice from a programmer’s perspective, but with such efficiency, unfortunately most of the Redis-dedicated hardware was horrendously underused. The CPU load and Disk I/O were the most commonly underused resources – with network I/O generally being well within limits as well.

To make the under-utilization worse, we need to claim more hardware to guarantee high-availability. Although Redis has made huge strides lately in high-availability (HA), you must still set up some form of automated master-slave failover in order to design a robust service relying on Redis. At Hulu, this is one of the foremost areas of focus on projects – the robustness and sensible degradation – so naturally there were several solutions (including use of Redis sentinel).

Initially, it was easy to maintain multiple clusters and the total amount of hardware used wasn’t too excessive. But every new project using Redis meant another cluster to maintain. Fast-forward a few years and we find these HA solutions are not only growing more numerous and burdensome to maintain, but that we have a lot of underutilized hardware. The common denominator: Redis.

 

Enter Monaco

 

We use Redis for a lot of services, so it was time we organized how it is used. Monaco was built to provide that organization. Monaco is a clustered service, which deploys, configures, and maintains, highly available redis-servers across a cluster of hardware. By providing a robust, distributed management layer, Monaco can fill the machines to capacity with redis-servers in a predictable manner – allowing us to utilize much more of the resources in the underlying machine.

In simplest terms, Monaco is Redis as a Service.

Because Monaco provides automation for all aspects of creating, maintaining, and monitoring Redis servers, our Developers now use Monaco when they need Redis for projects. As Monaco’s reliability was tested, more and more existing services migrated their Redis usage into Monaco as well. Today, we have 1.5TB of managed redis-servers in Monaco and are growing that number every day. Additionally, we have our hardware filled to our (somewhat conservative) desired limits. We’re not just cutting down on VM overhead – using Monaco has ensured that utilization of our Redis-hosting hardware is consistent and maximized.

 

How it Works

 

Before I describe how Monaco works; we’ve open-sourced this project so you can dive in as deep as you like: http://github.com/hulu/monaco (I’ll wait, go take a look).

At its heart, Monaco is based upon a Redis cluster itself. It uses Redis to distribute its internal state, communicate amongst nodes, and detect state changes. Monaco, very similarly to redis-sentinel, uses a leader election to maintain that redis-server cluster’s replication roles, but that’s where the similarities end.

Using the cluster’s Redis replication roles as our basis, all Monaco nodes have a role of either “master” or “slave.” Using a proven consensus algorithm (Raft) as the basis for leader election, we can assume this state to be consistent within our cluster. For the sake of clarity, I will refer to the Redis DB that Monaco uses as the “Monaco DB,” to differentiate it from the Redis servers that Monaco hosts. The master node is responsible for monitoring the health of the cluster, as well as the hosted redis-servers maintained in the cluster. Additionally, this master node can create/edit/delete hosted redis-servers in the cluster. Using keyspace notifications on the Monaco DB, all nodes instantly know about any change of state they need to enact.

All master/slave nodes are responsible for monitoring the Monaco DB and maintaining their node’s state. Subscriptions in the local Monaco DB inform the Monaco process of any relevant work. Additionally, each “slave” node reports back their hosted redis-servers’ status, so the master can perform any failover as necessary.

Finally, to provide an interface, there’s a web application, which can run on any subset of the nodes in a Monaco cluster. It provides a simple interface with explanations of all Monaco API functions. In order to populate the graphs in the interface, there’s a statistic-reporting Daemon that’s packaged with the open source release. That Daemon uses the Monaco DB to store recent stats. However, for large installs, it is most likely not adequate as your Monaco DB could grow quite large and slow (especially with all the replication in a large cluster).

 

Usage

 

Monaco provides a bunch of Redis clusters within its hardware, but now what? How do your Redis clients contact it in a robust manner? The answers will vary based on implementation, as Monaco has a few modular components.

The “exposure” module defines how Monaco’s master will attempt to inform your infrastructure of changes in Monaco-hosted Redis clusters. Consider this a hook that is triggered by Monaco state changes. At Hulu, we’ve defined this module to create/edit/delete loadbalancer configurations, allowing our Monaco DBs to be constantly available through a Virtual IP.

The “stat” module defines how the Monaco stat reporter will attempt to publish stats. By default it will use the Monaco DB, but at Hulu we use a Graphite stats-publisher to integrate it better with our service monitoring pipeline. Clearly, the exposure module is the more interesting one (although large deployments will have to make a stat module, even if it’s just to drop the stats). Because of our particular implementation, we’re able to use simple Redis clients with no HA logic baked into the client side – just retries on disconnect.

However, the default exposure module does nothing – infrastructure varies widely and it wouldn’t be possible to create a module that would work in all infrastructures. But in the event you don’t want to implement a custom exposure module, you can use Monaco reliably with one web API call. All cluster information is available through the web API in JSON format, which allows for more complex use cases such as splitting reads/writes across the master/slave nodes.

 

Redis Conf 2015

 

Monaco has been under development for a while, and it was a treat to share it with the Redis community at RedisConf2015. Towards the end of the conference, Salvatore Sanfillipo (creator of Redis) addressed the audience regarding open source software and that really motivated us to give back to the open source community. For those who may be interested more in the motivations and architecture of Monaco, here’s a link to the talk. http://redisconference.com/video/monaco/ .

 

DNS infrastructure at Hulu.

September 8th, 2015 by Kirill Timofeev

Introduction

For companies running their own datacenter, setting up internal DNS infrastructure is essential for performance and ease of maintenance. Setting up a single DNS server for occasional requests is pretty straightforward, but scaling and distributing requests across multiple data centers is challenging. In this post, we’ll describe the evolution of our DNS infrastructure from a simple setup to a more distributed configuration that is capable of reliably handling a significantly higher request volume.

DNS primer

In order to talk to each other via network computers are assigned ip addresses (http://en.wikipedia.org/wiki/IP_address). IP address can be either IPv4 (e.g. 192.168.1.100) or IPv6 (e.g. fe80::3e97:eff:fe3a:ef7c). Humans are not that good in memorizing IP addresses so the Domain Name System (DNS) was created (http://en.wikipedia.org/wiki/Domain_Name_System). DNS translates human readable domain names to IP address. This allows you to type www.hulu.com in your browser to watch your favorite show instead of typing http://165.254.237.137. DNS can be used to resolve different types of names. The most important examples are:

  • A name corresponding to an IPv4 address
  • AAAA name corresponding to an IPv6 address
  • CNAME alias to another name
  • PTR ip address, corresponding to another name

The DNS response contains a response code. A few important examples are:

  • NOERROR name was resolved ok
  • NXDOMAIN there is no ip address for given name
  • NODATA there is no information for given name and request type, but given name has information for other request types.

A DNS zone is any distinct, contiguous portion of the domain name space in the Domain Name System (DNS) for which administrative responsibility has been delegated to a single manager. The domain name space of the Internet is organized into a hierarchical layout of subdomains below the DNS root domain.

DNS servers will be referred to as nameservers through the rest of this document. Nameservers can be recursive if they forward a request further in case they can’t respond by themselves or authoritative if they respond to a given query without additional requests. Recursive DNS servers are called recursors.

For this document, we’ll call servers that send DNS requests to nameservers clients. Linux uses a resolv.conf file located at /etc/resolv.conf that contains DNS configuration, including the list of nameservers to use for DNS requests.

Initial configuration

Initially we had a very straightforward DNS infrastructure configuration:

  • 4 nameservers in each datacenter behind a load balancer
  • Each nameserver was running a PowerDNS (https://www.powerdns.com/) authoritative service and recursor service
  • The recursor service was configured to serve internal zones using PowerDNS server on the same host
  • Every client had a resolv.conf configured with the load balancer DNS virtual IP address and 2 ip addresses of backend nameservers. If a timeout occurred for a DNS request sent through the load balanced endpoint, a request would be sent directly to the nameservers.

This worked, but DNS uses UDP which does not guarantee delivery. While the overall DNS infrastructure worked ok, sometime names failed to resolve due to network issues or excessive traffic. As a result of the occasional failures, service owners tended to replace names with ip addresses in their configurations after seeing the occasional failures. This was especially true for services with extremely high request rates.

On the nameservers.

/etc/powerdns/pdns.conf:

default-soa-name=a.prd.hulu.com
launch=gsqlite3
gsqlite3-database=/opt/dns/sqlite_db
local-port=5300
log-failed-updates=yes
/etc/powerdns/recursor.conf:

allow-from=127.0.0.0/8, 192.168.1.0/24, 10.0.0.0/8, 1.2.3.0/24
forward-zones=10.in-addr.arpa=127.0.0.1:5300,prd.hulu.com=127.0.0.1:5300,stg.hulu.com=127.0.0.1:5300,tst.hulu.com=127.0.0.1:5300
 # /etc/powerdns/empty_zone is an empty file
auth-zones=hulu.com.prd.hulu.com=/etc/powerdns/empty_zone,hulu.com.stg.hulu.com=/etc/powerdns/empty_zone,hulu.com.tst.hulu.com=/etc/powerdns/empty_zone
local-address=0.0.0.0
log-common-errors=yes
max-negative-ttl=60
threads=2

On the clients

/etc/resolv.conf:

nameserver 10.1.1.53 # dns ip on load balancer
nameserver 10.1.1.120 # nameserver behind load balancer
nameserver 10.1.1.121 # another nameserver behind load balancer
search prd.hulu.com stg.hulu.com tst.hulu.com

Adding local DNS caching

In order to minimize the chance of DNS resolution failure because of network issues, we decided to setup a local DNS cache on each client. We compared several popular DNS caches such as unbound, pdnsd, PowerDNS recursor and nscd. Unbound demonstrated the best performance, excellent stability and ease of configuration, so we gradually rolled it out to all clients.

Initially, resolv.conf was configured to point to the load balanced DNS nameservers. We updated it to list unbound first, followed by the load balanced nameserver in case unbound failed. It is important to note that if you have multiple nameserver entries in resolv.conf by default they would be used in the order they are listed with 2 retries and 5 second timeout (see man resolv.conf for more details). This means, that each failing nameserver in resolv.conf would cause 10 seconds delay in resolving name. However, this is not the case for a local nameserver running on 127.0.0.1. Resolver detects that port udp 53 on localhost is not listening almost instantaneously and tries next nameserver from the list so delay would be negligible. However, it’s noteworthy that Python, Ruby, Java and Node.js cache nameservers IPs from resolv.conf. So, services running in those languages can see resolution delays when restarting the local unbound nameserver.

Results were very good, traffic decreased and there were no failures while resolving DNS names.

On the clients.

/etc/resolv.conf:

nameserver 127.0.0.1 # unbound
nameserver 10.1.1.53 # dns ip on load balancer
search prd.hulu.com stg.hulu.com tst.hulu.com
/etc/unbound/unbound.conf:

server:
  access-control: 0.0.0.0/0 allow
  prefetch: yes
  do-ip6: no
  local-zone: "10.in-addr.arpa." transparent
remote-control:
  control-enable: no
forward-zone:
  name: "."
  forward-addr: 10.1.1.53

Blocking AAAA queries

We noticed that a significant part of responses were nxdomain for names like host.prd.hulu.com.tst.hulu.com. This was caused by 2 reasons: search list in resolv.conf and AAAA queries from python. We don’t use IPv6 in our internal network, so AAAA queries were only adding unnecessary traffic and increasing the chance that a valid request would not be answered due to the fact that DNS uses UDP and UDP is a protocol without delivery guarantee. After a close look at unbound’s features, we blocked most of the bad queries by creating zones, which unbound would serve locally.

On the clients.

/etc/unbound/unbound.conf:

server:
  access-control: 0.0.0.0/0 allow
  prefetch: yes
  do-ip6: no
  local-zone: "com.prd.hulu.com" static
  local-zone: "com.stg.hulu.com" static
  local-zone: "com.tst.hulu.com" static
  local-zone: "10.in-addr.arpa." transparent
remote-control:
  control-enable: no
forward-zone:
  name: "."
  forward-addr: 10.1.1.53

Making unbound daemons talk directly to nameservers

Everything was working OK until we had a failure in the load balancer we were using for DNS. Services were not able to resolve DNS names and started to fail. Given how important DNS is, we decided that DNS should not be tied to the reliability of a load balancer, so we decided to remove the load balancer from the DNS infrastructure. This was easy to accomplish: instead of pointing to the load balancer VIP, we reconfigured unbound to talk directly to the nameservers.

On the clients.

/etc/unbound/unbound.conf:

server:
  access-control: 0.0.0.0/0 allow
  prefetch: yes
  do-ip6: no
  local-zone: "com.prd.hulu.com" static
  local-zone: "com.stg.hulu.com" static
  local-zone: "com.tst.hulu.com" static
  local-zone: "10.in-addr.arpa." transparent
remote-control:
  control-enable: no
forward-zone:
  name: "."
  forward-addr: 10.1.1.120
  forward-addr: 10.1.1.121
  forward-addr: 10.1.1.122
  forward-addr: 10.1.1.123

Making DNS resolution immune to internet connection outages

Some time after that, we had an internet connection issue involving outbound traffic in one of our datacenters. It is important to note, that authoritative DNS for our externally facing DNS names is hosted by our CDN. Resolution of these names in our datacenter also use the CDN. We do this in order to have unified DNS resolution from the internal Hulu network and from the internet (from Hulu’s customers perspective). This unified approach allows us to easily reproduce any customer reported issues related to DNS. But this also means that without an internet connection we can’t resolve our external DNS names, even when the IPs that they resolve to physically reside in our datacenters. Because names couldn’t be resolved many services in this datacenter stopped working even though they didn’t have a dependency on the outside internet. We decided to modify DNS so that a failure of the internet connection in a single datacenter wouldn’t affect DNS. This was a significant change to the nameservers. What we did:

  • The original PowerDNS recursor service was retired, and we switched to unbound
  • Incoming DNS requests to the nameservers were served by unbound, configured with appropriate zones.
  • Requests for internal zones were forwarded to the PowerDNS authoritative services. Each unbound instance talked to all PowerDNS servers, so failure of one PowerDNS authoritative server would have a small performance penalty
  • External requests were forwarded to another unbound layer. When the internet connection worked, this layer talked to the unbound instances in the local datacenter. If the outbound internet connection fails, the unbound instances from other datacenter are added, so that external names can continue to be resolved. It is important to note that using unbound recursors from other datacenter can result in wrong geo-location information for external names. Instead of getting an address close to datacenter where request originated from, an address close to datacenter with alive internet connection is returned. This can increase round trip times and service latencies.

On the nameservers.

/etc/unbound/unbound.conf:

server:
  interface: 0.0.0.0
  access-control: 0.0.0.0/0 allow
  prefetch: yes
  rrset-roundrobin: yes
  do-ip6: no
  do-not-query-localhost: no
  extended-statistics: yes
  local-zone: "com.prd.hulu.com" static
  local-zone: "com.stg.hulu.com" static
  local-zone: "com.tst.hulu.com" static
  local-zone: "10.in-addr.arpa." transparent
forward-zone:
  name: "10.in-addr.arpa"
  forward-addr: 10.1.1.120@5300
  forward-addr: 10.1.1.121@5300
  forward-addr: 10.1.1.122@5300
  forward-addr: 10.1.1.123@5300
forward-zone:
  name: "prd.hulu.com"
  forward-addr: 10.1.1.120@5300
  forward-addr: 10.1.1.121@5300
  forward-addr: 10.1.1.122@5300
  forward-addr: 10.1.1.123@5300
forward-zone:
  name: "stg.hulu.com"
  forward-addr: 10.1.1.120@5300
  forward-addr: 10.1.1.121@5300
  forward-addr: 10.1.1.122@5300
  forward-addr: 10.1.1.123@5300
forward-zone:
  name: "tst.hulu.com"
  forward-addr: 10.1.1.120@5300
  forward-addr: 10.1.1.121@5300
  forward-addr: 10.1.1.122@5300
  forward-addr: 10.1.1.123@5300
forward-zone:
  name: "."
  forward-addr: 10.1.1.120@5301
  forward-addr: 10.1.1.121@5301
  forward-addr: 10.1.1.122@5301
  forward-addr: 10.1.1.123@5301
/etc/unbound/unbound-5301.conf:

server:
  interface: 0.0.0.0
  port: 5301
  access-control: 0.0.0.0/0 allow
  prefetch: yes
  do-ip6: no
  extended-statistics: yes
  pidfile: /var/run/unbound-5301.pid
remote-control:
  control-port: 8954
/opt/dns/bin/dns-monitor.sh:

#!/bin/bash

set -u
dc1_ips="10.1.1.120 10.1.1.121 10.1.1.122 10.1.1.123" dc2_ips="10.2.1.120 10.2.1.121 10.2.1.122 10.2.1.123"
case "$(hostname -s|cut -f1 -d-)" in dc1) here_ips=$dc1_ips; there_ips=$dc2_ips ;; dc2) here_ips=$dc2_ips; there_ips=$dc1_ips ;; * ) exit 1 ;; esac
PID_FILE=/var/run/$(basename $(readlink -f $0)).pid
all_ips=$(echo $dc1_ips $dc2_ips|sed -e 's/ /\n/g'|sort)
check_upstream() { for ip; do [ "$(dig @$ip -p 5301 +tries=1 +time=1 +short \ google-public-dns-a.google.com)" == "8.8.8.8" ] && return 0 done return 1 }
set_zone() { forwarders=() for ip; do forwarders+=(${ip}@5301) done /usr/sbin/unbound-control forward_add . ${forwarders[@]} }
run_check() { current_zone=$(/usr/sbin/unbound-control list_forwards|grep '^. '| \ sed -e 's/.*forward: //' -e 's/ /\n/g'|sort) here_status=down there_status=down cross_dc_status=down check_upstream $here_ips && here_status=up [ "$current_zone" == "$all_ips" ] && cross_dc_status=up [ "${here_status}${cross_dc_status}" == "upup" ] && { set_zone $here_ips return } check_upstream $there_ips && there_status=up [ "${here_status}${cross_dc_status}${there_status}" == "downdownup" ] && { set_zone $all_ips return } }
get_lock() { touch ${PID_FILE}.test || { echo "Can't create ${PID_FILE}" exit 1 } rm -f ${PID_FILE}.test while true; do set -- $(LC_ALL=C ls -il ${PID_FILE} 2>/dev/null) if [ -z "${1:-}" ] ; then ln -s $$ $PID_FILE && return 0 else ps ${12} >/dev/null 2>&1 && return 1 find $PID_FILE -inum ${1} -exec rm -f {} \; fi done }
get_lock || exit 1
exec &>/dev/null
while sleep 5; do run_check done
/etc/cron.d/dns-monitor:

SHELL=/bin/bash
* * * * * root /opt/dns/bin/dns-monitor.sh

Serving programmatically generated names

In order to test certain services, we needed a testing host to have a name in the *.hulu.com domain. In case when testing is done using a workstation it is not always possible to use a dedicated name. We decided to use special domain ip.hulu.com. Names in form A.B.C.D.ip.hulu.com are resolved to ip address A.B.C.D. This can be done using unbound python extension. Another useful feature that can be implemented in python extension is datacenter aware names. For example if we have service.dc1.prd.hulu.com in datacenter dc1 and service.dc2.prd.hulu.com in dc2 we can have virtual domain dc.prd.hulu.com so that service.dc.prd.hulu.com would resolve to the proper name local to this datacenter.

On nameserver:

/etc/unbound/unbound.conf:

server:
  interface: 0.0.0.0
  access-control: 0.0.0.0/0 allow
  prefetch: yes
  rrset-roundrobin: yes
  do-ip6: no
  do-not-query-localhost: no
  extended-statistics: yes
  local-zone: "com.prd.hulu.com" static
  local-zone: "com.stg.hulu.com" static
  local-zone: "com.tst.hulu.com" static
  local-zone: "10.in-addr.arpa." transparent
forward-zone:
  name: "10.in-addr.arpa"
  forward-addr: 10.1.1.120@5300
  forward-addr: 10.1.1.121@5300
  forward-addr: 10.1.1.122@5300
  forward-addr: 10.1.1.123@5300
forward-zone:
  name: "prd.hulu.com"
  forward-addr: 10.1.1.120@5300
  forward-addr: 10.1.1.121@5300
  forward-addr: 10.1.1.122@5300
  forward-addr: 10.1.1.123@5300
forward-zone:
  name: "stg.hulu.com"
  forward-addr: 10.1.1.120@5300
  forward-addr: 10.1.1.121@5300
  forward-addr: 10.1.1.122@5300
  forward-addr: 10.1.1.123@5300
forward-zone:
  name: "tst.hulu.com"
  forward-addr: 10.1.1.120@5300
  forward-addr: 10.1.1.121@5300
  forward-addr: 10.1.1.122@5300
  forward-addr: 10.1.1.123@5300
forward-zone:
  name: "dc.prd.hulu.com"
  forward-addr: 10.1.1.120@5301
  forward-addr: 10.1.1.121@5301
  forward-addr: 10.1.1.122@5301
  forward-addr: 10.1.1.123@5301
forward-zone:
  name: "dc.stg.hulu.com"
  forward-addr: 10.1.1.120@5301
  forward-addr: 10.1.1.121@5301
  forward-addr: 10.1.1.122@5301
  forward-addr: 10.1.1.123@5301
forward-zone:
  name: "dc.tst.hulu.com"
  forward-addr: 10.1.1.120@5301
  forward-addr: 10.1.1.121@5301
  forward-addr: 10.1.1.122@5301
  forward-addr: 10.1.1.123@5301
forward-zone:
  name: "."
  forward-addr: 10.1.1.120@5301
  forward-addr: 10.1.1.121@5301
  forward-addr: 10.1.1.122@5301
  forward-addr: 10.1.1.123@5301
/etc/unbound/unbound-5301.conf:

server:
  interface: 0.0.0.0
  port: 5301
  access-control: 0.0.0.0/0 allow
  prefetch: yes
  do-ip6: no
  extended-statistics: yes
  pidfile: /var/run/unbound-5301.pid
  module-config: "validator python iterator"
python:
  python-script: "/etc/unbound/unbound-5301.py"
remote-control:
  control-port: 8954
/etc/unbound/unbound-5301.py:

DC = "dc1" # or dc2 depending which datacenter we are in

def init(id, cfg): return True
def deinit(id): return True
def inform_super(id, qstate, superqstate, qdata): return True
def create_response(id, qstate, in_rr_types, out_rr_type, pkt_flags, msg_answer_append): # create instance of DNS message (packet) with given parameters msg = DNSMessage(qstate.qinfo.qname_str, out_rr_type, RR_CLASS_IN, pkt_flags) # append RR if qstate.qinfo.qtype in in_rr_types: msg.answer.append(msg_answer_append) # set qstate.return_msg if not msg.set_return_msg(qstate): qstate.ext_state[id] = MODULE_ERROR return True # we don't need validation, result is valid qstate.return_msg.rep.security = 2 qstate.return_rcode = RCODE_NOERROR qstate.ext_state[id] = MODULE_FINISHED return True
def operate(id, event, qstate, qdata): if (event == MODULE_EVENT_NEW) or (event == MODULE_EVENT_PASS): a = qstate.qinfo.qname_str.split('.') if len(a) > 5 and a[-1] == '' and a[-2] == 'com' and a[-3] == 'hulu': if len(a) == 8 and a[-4] == 'ip' \ and 0 <= int(a[-5]) <= 255 \ and 0 <= int(a[-6]) <= 255 \ and 0 <= int(a[-7]) <= 255 \ and 0 <= int(a[-8]) <= 255: msg_answer_append = "{0} 300 IN A {1}.{2}.{3}.{4}".format(qstate.qinfo.qname_str, a[-8], a[-7], a[-6], a[-5]) create_response(id, qstate, [RR_TYPE_A, RR_TYPE_ANY], RR_TYPE_A, PKT_QR | PKT_RA | PKT_AA, msg_answer_append) return True if a[-5] == 'dc': a[-5] = DC msg_answer_append = "{0} 300 IN CNAME {1}".format(qstate.qinfo.qname_str, '.'.join(a)) create_response(id, qstate, [RR_TYPE_CNAME, RR_TYPE_A, RR_TYPE_ANY], RR_TYPE_CNAME, PKT_QR | PKT_RA, msg_answer_append) return True else: # pass the query to validator qstate.ext_state[id] = MODULE_WAIT_MODULE return True
if event == MODULE_EVENT_MODDONE: log_info("pythonmod: iterator module done") qstate.ext_state[id] = MODULE_FINISHED return True
log_err("pythonmod: bad event") qstate.ext_state[id] = MODULE_ERROR return True

Early blocking of unwanted queries

After more research, we figured out that blocking of unwanted queries can be done early with unbound extension. Specifically in our case we wanted to block resolving of all IPv6 names from Hulu domain (since we are not using IPv6 in our internal network). Here is how this can be done:

On the clients.

/etc/unbound/unbound.conf:

server:
server:
  access-control: 0.0.0.0/0 allow
  prefetch: yes
  do-ip6: no
  rrset-roundrobin: yes
  chroot: ""
  local-zone: "com.prd.hulu.com" static
  local-zone: "com.stg.hulu.com" static
  local-zone: "com.tst.hulu.com" static
  local-zone: "10.in-addr.arpa." transparent
  module-config: "validator python iterator"
python:
  python-script: "/etc/unbound/unbound.py"
remote-control:
  control-enable: no
forward-zone:
  name: "."
  forward-addr: 10.1.1.120
  forward-addr: 10.1.1.121
  forward-addr: 10.1.1.122
  forward-addr: 10.1.1.123
/etc/unbound/unbound.py:

def init(id, cfg): return True

def deinit(id): return True
def inform_super(id, qstate, superqstate, qdata): return True
def operate(id, event, qstate, qdata): if (event == MODULE_EVENT_NEW) or (event == MODULE_EVENT_PASS): qtype = qstate.qinfo.qtype qname_str = qstate.qinfo.qname_str if (qtype == RR_TYPE_AAAA and qname_str.endswith(".hulu.com.")): # create instance of DNS message (packet) with given parameters msg = DNSMessage(qname_str, qtype, RR_CLASS_IN, PKT_QR | PKT_RA | PKT_AA) # set qstate.return_msg if not msg.set_return_msg(qstate): qstate.ext_state[id] = MODULE_ERROR return True
# we don't need validation, result is valid qstate.return_msg.rep.security = 2
qstate.return_rcode = RCODE_NXDOMAIN qstate.ext_state[id] = MODULE_FINISHED return True else: # pass the query to validator qstate.ext_state[id] = MODULE_WAIT_MODULE return True
if event == MODULE_EVENT_MODDONE: # log_info("pythonmod: iterator module done") qstate.ext_state[id] = MODULE_FINISHED return True
log_err("pythonmod: bad event") qstate.ext_state[id] = MODULE_ERROR return True

Cleanup of search list in resolv.conf

Finally, we removed the search list from /etc/resolv.conf on the clients. Search list is useful when you want to use short names instead of fully qualified domain names (e.g. myservice instead of myservice.prd.hulu.com). But this is definitely a bad practice since if you have myservice.prd.hulu.com and myservice.stg.hulu.com if you use short name you would get the one which is first in the search list. Instead of a search list, we are now using domain resolv.conf directive, which still allows usage of short names, but is limited to a single domain, so there will be no ambiguity in name resolution.

On the clients.

/etc/resolv.conf:

nameserver 127.0.0.1 # unbound
nameserver 10.1.1.53 # dns ip on load balancer
domain prd.hulu.com

DNS naming conventions

We found that once a DNS name starts being used, it is extremely hard to deprecate. Consequently, it’s best to think carefully about naming schemas and follow them rigorously. We found that it’s generally a good idea to have domains specific to production, testing, etc:

  • prd.company.com
  • tst.company.com
  • dev.company.com

If you have multiple data centers it may be useful to have data center specific domains:

  • dc0.company.com
  • dc1.company.com

Using high level names like git.company.com, help.company.com etc for internal names seems a good idea to have short names and save on typing, but their support may become quite complicated with time. It’s often better to use domains specific to production, testing, and dev (e.g. git.prd.company.com etc).

Closing thoughts

In the first portion of our upgrade we added local caches that talked directly to nameservers and proxied traffic between datacenters. This immediately minimized the effect of internal or external network outages on name resolution. Once we had a more stable setup, we increased the performance of the overall system by minimizing a significant amount of irrelevant queries. We’re continuing to improve on our DNS infrastructure by adding programmatically generated names to allow services in a datacenter to automatically find their counterpart services and databases. We’ve been very happy with our new setup and hope that the details we’ve shared here can prove useful to others looking to scale up their own DNS infrastructure.

Voidbox – Docker on YARN

August 6th, 2015 by Huahui Yang

1. Voidbox Motivation

YARN is the distributed resource management system in Hadoop 2.0, which is able to schedule cluster resources for diverse high-level applications such as MapReduce, Spark. However, nowadays, all existing framework on top of YARN are designed with assumption of specific system environment. How to support user applications with arbitrary complex environment dependencies is still an open question. Docker gives the answer.

Docker is a very popular container virtualization technology. It provides a way to run almost any application isolated in a container. Docker is an open platform for developing, shipping, and running applications. Docker automates the deployment of any application as a lightweight, portable, self-sufficient container that will run virtually anywhere.

In order to integrate the unique advantages of Docker and YARN, the Hulu engineering team developed Voidbox. Voidbox enables any application encapsulated in docker image running on YARN cluster along with MapReduce and Spark. Voidbox brings the following benefits:

  • Ease creating distributed application
    • Voidbox handles most common issues in distributed computation system, say it, cluster discovery, elastic resource allocation, task coordination, disaster recovery. With its well-designed interface, it’s easy to implement a distributed application.
  • Simplify deployment
    • Without Voidbox, we need to create and maintain dedicated VM for application with complex environment even though the VM image is huge and not easy to deploy. With Voidbox, we could easily get resource allocated and make app run right the time we need it. Additional maintenance work is eliminated.
  • Improve cluster efficiency
    • As we could deploy Spark/MR and all kinds of Voidbox applications from different department together, we could maximize cluster usage.

Thus, YARN as a big data operating platform has been further consolidated and enhanced.

Voidbox supports Docker container-based DAG(Directed Acyclic Graph) tasks in execution. Moreover, Voidbox provides several ways to submit applications considering demands of the production environment and the debugging environment. In addition, Voidbox can cooperate with Jenkins, GitLab and private Docker Registry to set up a set of developing, testing, automatic release process.

2.Voidbox Architecture

2.1 YARN Architecture Overview

YARN enables multiple applications to share resources dynamically in a cluster. Here is the architecture of applications running in YARN cluster:

voidbox_figure1

Figure 1. YARN Architecture

As shown in figure 1, a client submits a job to Resource Manager. The Resource Manager performs its scheduling function according to the resource requirements of the application. Application Master is responsible for the application tasks scheduling and execution of an application’s lifecycle.

Functionality of each modules:

  • Resource Manager: Responsible for resource management and scheduling in cluster.
  • NodeManager: Running on the compute nodes in cluster, taking care of task execution in the individual machine, collecting informations and keeping heartbeat with Resource Manager.
  • Application Master: Takes care of requesting resources from YARN, then allocates resources to run tasks in Container.
  • Container: Container is an abstract notion which incorporates elements such as memory, cpu, disk, network etc.
  • HDFS: Distributed file system in YARN cluster.

2.2 Voidbox Architecture Design

In Voidbox architecture, YARN is responsible for the cluster’s resource management. Docker acts as the task execution engine above of the operating system, cooperating with Docker Registry. Voidbox helps to translate user programming code into Docker container-based DAG tasks, apply for resources according to requirements and deal with DAG in execution.

voidbox_figure2

Figure 2. Voidbox Architecture

As shown in figure 2, each box stands for one machine with several modules running inside. To make the architecture more clearly, we divide them into three parts, and functionality of Voidbox modules and Docker modules:

  • Voidbox Modules:
    • Voidbox Client: The client program. Through Voidbox Client, users can submit a Voidbox application, stop it, and so on. By the way, Voidbox application contains several Docker jobs and a Docker job contains one or more Docker tasks.
    • Voidbox Master: Actually, it’s an application master in YARN, and takes care of requesting resources from YARN, then allocates resources to Docker tasks.
    • Voidbox Driver: Responsible for task scheduling of a single Voidbox application. Voidbox supports Docker container-based DAG task scheduling and between tasks we can insert some other codes. So Voidbox Driver should handle the order scheduling of DAG task dependencies and execute the user’s code.
    • Voidbox Proxy: The bridge between YARN and Docker engine, responsible for transiting commands from YARN to Docker engine, such as start or kill Docker container, etc.
    • State Server: Maintaining the informations of Docker engine’s health status, providing the list of machines which can run Docker container. So Voidbox Master can apply for resources more efficiently.
  • Docker Modules:
    • Docker Registry: Docker image storage, acting as an internal version control tool of Docker image.
    • Docker Engine: Docker container execution engine, obtaining specified Docker image from Docker Registry and launching Docker container.
    • Jenkins: Cooperating with GitLab, when application codes update, Jenkins will take care of automated testing, packaging, generating the Docker image and uploading to Docker Registry, to complete the application automatically release process.

2.3 Running Mode

Voidbox provides two application running modes: yarn-cluster mode and yarn-client mode.

In yarn-cluster mode, the control component and resource management component are running in the YARN cluster. After we submit the Voidbox application, Voidbox Client can quit at any time without affecting the running time of application. It’s for the production environment.

In yarn-client mode, the control component is running in Voidbox Client, and other components are in the cluster. Users can see much more detailed logs about the application’s status. When Voidbox Client quits, the application in cluster will exit too. So it’s more convenient for debugging.

Here we briefly introduce the implementation architecture of the two modes:

  • yarn-cluster mode

voidbox_figure3

Figure 3. yarn-cluster mode

As shown in figure 3, Voidbox Master and Voidbox Driver are both running in the cluster. Voidbox Driver is responsible for controlling the logic and Voidbox Master takes care of application resource management.

  • yarn-client mode

voidbox_figure4

Figure 4. yarn-client mode

As shown in figure 4, Voidbox Master is running in the cluster, and Voidbox Driver is running in Voidbox Client. Users can submit Voidbox application in IDE for debugging.

2.4 Running Procedure

Here are the procedures of submitting a Voidbox application and its lifecycle:

  1. Users write a Voidbox application by Voidbox SDK and generate a java archive, then submit it to the YARN cluster by Voidbox Client;
  2. After receiving Voidbox application, Resource Manager will allocate resources for Voidbox Master, then launch it.
  3. Voidbox Master starts Voidbox Driver, the latter will decompose Voidbox application into several Docker jobs(a job contains one or more Docker tasks). Voidbox Driver calls Voidbox Master interface to launch the Docker tasks in compute nodes.
  4. Voidbox Master requests resources from Resource Manager, and Resource Manager allocates some YARN containers according to the YARN cluster status. Voidbox Master launches Voidbox Proxy in YARN container, and the latter is responsible for communication with Docker engine to start the Docker container.
  5. User’s Docker task is running in Docker container, and the log output to a local file. User can see real-time application logs through YARN Web Portal.
  6. After all Docker tasks are done, the logs will be aggregate to HDFS, so user still can get the application logs by history server.

2.5 Docker integrating with YARN in resource management

YARN acts as a uniform resource manager in the cluster, and is responsible for resource management on all machines. Docker as a container engine also has the function of resource management. So how to integrate their resource management function is particularly important.

In YARN, the user task can only run in the YARN container, while Docker container can only be handled by Docker engine. This case would get out of the management of YARN and damage the unified management and scheduling principle of YARN, which could produce resource leaks risk issue. In order to enable YARN to manage and schedule Docker container, we need to build a proxy layer between YARN and Docker engine. This is why Docker Proxy is introduced. Through Voidbox Proxy, YARN can manage the container lifecycle including start, stop, etc.

In order to understand Voidbox Proxy more clearly, we take stopping Voidbox application as an example. When a user needs to kill Voidbox application, YARN will recycle all the resources of the application. At this point, YARN will send a kill signal to the related machines. The corresponding Voidbox Proxy will catch the kill signal, then stop Docker container in Docker engine to do the resource recycling. So with the help of Voidbox Proxy, it can not only stop YARN container, but also stop the Docker container to avoid resources leaks issue(This is the problem existing in open source version, see YARN-1964).

3. Fault Tolerance

Although Docker has some stable releases, the enterprise production environment has a variety versions of operating system or kernel, so it brings unstable factors. We consider multiple levels in Voidbox fault-tolerant design to ensure Voidbox’s high availability.

  • Voidbox Master fault tolerance
    • If Resource Manager finds Voidbox Master crashes, it will notify NodeManager to recycle all the YARN containers belonging to this Voidbox application, then restart Voidbox Master.
  • Voidbox Proxy fault tolerance
    • If Voidbox Master finds Voidbox Proxy crashes, it will recycle Docker containers on behalf of Voidbox Proxy.
  • Docker container fault tolerance
    • Each Voidbox application can configure the maximum retry times on failure, when the Docker container crashes, Voidbox Master will do some work according to the exit code of Docker container.

4. Programming model

4.1 DAG Programming model

Voidbox Provides Docker container-based DAG programming model. A sample would look similar to this:

voidbox-figure5

Figure 5. Docker container-based DAG programming model

As shown in figure 5, there are four jobs in this Voidbox application, and each job can configure its requirements of CPU, Memory, Docker image, parallelism and so on. Job3 will start when job1 and job2 both complete. Job1, job2 and job3 make a stage, so user can insert their codes after this stage is done, and finally start running job4.

4.2 Shell mode to submit one task

In most cases, we would like to run a single Docker container-based task without programming. So Voidbox supports shell mode to describe and submit the Docker container-based task, actually it’s a implementation based on DAG programming mode.The example usage of Voidbox in shell mode:

docker-submit.sh \

-docker_image centos \

-shell_command “echo Hello Voidbox” \

-container_memory 1000 \

-cpu_shares 2

The shell script above will submit a task to run “echo Hello Voidbox” in a docker image named ‘centos’, and the resource requirement is 1000Mb memory, 2 cpu virtual cores. 

5. Voidbox in Action

At present we can run Docker, MapReduce, Spark and other applications in YARN cluster. There has been lots of short tasks using Voidbox within HULU.

  • Automation testing process
    • Cooperating with Jenkins, GitLab and private Docker registry, when the application codes update, Jenkins will complete automatic test, package program, regenerate Docker image and push it to the private Docker Registry. It’s a process of development, testing and automatically release.
  • Complex tasks in parallel
    • Test Framework is used to do some testings to detect the availability of some components. The project is implemented by Ruby/Java and has complex dependencies. So we maintain two layers of Docker image, the first layer is the system software as a base image, and the second layer is the business level. We publish a test framework Docker image and use some timing scheduling software to start Voidbox application regularly. Thanks to Voidbox, we solve the issues such as the complex dependencies and the multitasking parallelism.
    • Facematch(link:http://tech.hulu.com/blog/2014/05/03/face-match-system-overview/) is a video analysis application. It’s implemented by C and has lots of graphics libraries. That can be optimized by Voidbox: first of all we need to package all face match program into a Docker image, then write Voidbox application to handle the multiple videos. Voidbox solves the complex machine environment and the parallelism control problem.
  • Building complex workflow
    • Some tasks have a dependent with each other, such as it needs to load user behaviors first, then do the analysis of user behaviors. These two steps have successively dependencies. We use Voidbox container-based programming model to handle this case easily.

6. Different from DockerContainerExecutor in YARN 2.6.0

  • DockerContainerExecutor(link:https://issues.apache.org/jira/browse/YARN-1964) is released in YARN 2.6.0 and it’s alpha version. Not mature enough, and it is only an encapsulation layer above the default executor.
  • DockerContainerExecutor is difficult to coexist with other ContainerExecutor in one YARN cluster.
  • Voidbox features
    • DAG programming model
    • Configurable container level of fault tolerance
    • A variety of running modes, considering development environment and production environment
    • Share YARN cluster resources with other Hadoop job
    • Graphical log view tool

7. Future work

  • Support more versions of YARN
    • Voidbox would like to support more versions in the future besides YARN 2.6.0.
  • Voidbox Master fault tolerance, persistent metadata to reduce the cost in case of retry
    • Currently, if a Voidbox Master crashes, YARN will recycle resources belonging to this Voidbox application and restart Voidbox Master to do some tasks from the very beginning. It’s not necessary to impact tasks which are already done or running. We might keep some metadatas in the State Server to reduce the cost in case of Voidbox Master on-failure.
  • Voidbox Master as a permanent service
    • Voidbox will support long running Voidbox Master to receive streaming tasks.
  • Support long service
    • Voidbox will support long running service if Voidbox Master’s downtime doesn’t influence running task.

You Can Now Use the Apple Watch as a Hulu Remote

July 15th, 2015 by admin

Today, we are excited to announce that we’ve created a new Hulu application for the Apple Watch that brings some of the most important features of a remote control directly to your wrist and allows you to control your viewing experience with a few simple taps.

At Hulu, we are constantly trying to find new and innovative ways to make the viewing experience as seamless as possible for our viewers. The Hulu application on the Apple Watch is the perfect opportunity to explore the Apple Watch OS and experiment with ways to integrate the Hulu experience into the popularity of wearable platforms.

With the Hulu app, you will be able to play, pause and rewind your favorite shows on Apple TV, Chromecast, Xbox One, PS3 and PS4 with a tap on your wrist. You will also be able to toggle captions from the Hulu app for Apple Watch.

You will be able to connect directly to an existing Chromecast or Xbox ONE, PS3 or PS4 device that’s streaming Hulu, and control it right when you launch your Hulu app on the Apple Watch.

If you watch on Apple TV, you will have to first launch a Hulu stream via Apple TV on your phone, and then you will be able to use your Apple Watch Hulu application as a remote.

Hulu for Apple Watch is now available in the Apple Watch app store. Stay tuned for more updates and feature additions.

 The Hulu app for Apple Watch was implemented by the mobile team intern, Rahin Jegarajaratnam who was mentored by Bradley Snyder along with the iOS dev team. Our intern program is unique in that we actively use it as an opportunity for interns to work on projects that directly touch consumers.

Aggregation of Relevance Tables with Expert Labeling

May 26th, 2015 by Heng Su

(by Wenkui Ding and Heng Su)

Introduction

In Hulu we continuously seek ways to improve our users’ content discovery experiences by various recommendation techniques. One of the most important components supporting the content discovery products is the relevance table.

The relevance table could be simply regarded as a 2-dimensional real-valued matrix showing how “relevant” or “similar” every two pieces of content are. Ultimately we need one single relevance table to generate the recommendation results such as autoplay videos or top 10 recommended shows for our users. But the problem is that, internally instead of one single relevance table, we get many (sub-) relevance tables from different sources, for example we have relevance tables generated from our users’ watch behaviors, search behaviors and the content metadata in Hulu, respectively. We will introduce how do we generate the final relevance table in production by aggregating these sub relevance tables with domain expert labeling information.

Without loss of generality, to simplify the problem, let’s assume all relevance tables are in the grain size of TV shows, i.e., each relevance table represents the relevance between TV shows.

The Workflow

The simplest and maybe the most intuitive way to aggregate the sub relevance tables is to manually evaluate the quality and assign a weight for each relevance table, then we can just do weighted linear combination of those relevance tables to generate the result. However apparently this is not good enough: First the quality of the relevance tables will change when they update; second this global model is not the best to capture all the useful information in those relevance tables. For example some accurate relevance information will be neglected because the overall quality of the relevance table containing it is low. So we use a more sophisticated aggregation algorithm.

The entire workflow is as the following. In the next chapters the details of the process will be described.

Fig 1. Relevance Table Aggregation Workflow

The “Libra” Front-end

First we have a front-end to collect domain experts’ label results with the name “Libra”. In each label result, the expert is presented three shows (referred to as a “show tuple”, denoted by A, B and C), and the expert should answer the question that “Which show in B and C is more relevant to show A?” The answer could be B or C. When both B and C are not relevant, or it’s hard to make the judgment, the expert could also select “skip this tuple”.

Note the same show tuple could be labeled by different experts or even the same expert at different time to test if consensus could be made.

Instead of letting the expert directly assign the relevance value between two shows, we prefer the above way because it’s much easier for the expert to compare two relevance values than decide one relevance value.

The Machine-learning-based Relevance Aggregation

Second, a machine-learning-based algorithm based on the label results (as ground truth) and the sub-relevance tables (as features) is introduced to generate the final combined relevance table for our various products.

The following objective function is defined:

where (k,i,j) is the labeling result showing that the expert prefers show i to show j for source show k; fmq17 is the vector containing the relevance values from show k to show i in all sub-relevance tables; fmq17 is a scoring function to generate the final relevance table, and fmq17 is a parameter controlling the “smoothness” of the objective function. So the goal is to minimize the objective L and get the optimal function fmq17.

We propose two ways to model fmq17: linear combination and non-linear combination.

(1) Linear Combination

fmq17 is modeled as follows:

where w is the weight vector. The optimal w could be generated by stochastic gradient descent (SGD), as the following algorithm shows.

Algorithm I


a. For each round of iteration, enumerate all labeling results and for each label result (k,i,j), do the following:

  (i). Calculate the gradient of the loss function w.r.t the weight vector:   (ii). Update the weight vector: b. The iteration is repeated until the weight vector converges or the total loss is lower than a threshold.


 

(2) Non-linear (Additive) Combination

fmq17 is modeled as a boosted regression tree:

where fmq17 is a one-level tree, i.e.fmq17 is a binary function with only 0 and 1 as output and is generated by thresholding one of the sub relevance tables, and fmq17 is the corresponding multiplier.

The gradient boosting method is used to get the (sub-) optimal fmq17, as the following algorithm describes.

Algorithm II


a. Initialize fmq17

b. In each round of iteration, do the following:

  (i). Enumerate all pairs of shows and for each show pair , compute the pseudo-residuals for each item:

  (ii). Fit a sub-model fmq17 to pseudo-residuals fmq17.

  (iii). Compute the multiplier by:

  (iv). Update the model:

c. Repeat the above process until the total loss is below a threshold.


 

Results

Significant improvements have been observed from experiments using the new aggregated relevance table, compared with the manual-weighted linear combination method. We have found:

  1. Around 4.51%+ on eCTR (effective CTR) on the “YouMayAlsoLike” tray relatively, and
  2. Around 5.69%+ on watch minutes from the part of autoplay that is related to the relevance table.

Conclusion

The relevance table serves as an important part in the recommendation system in Hulu. Online experiments show significant improvements on the machine-learning-based relevance table aggregation over fixed weighted combination, especially with the non-linear combination method,. Furthermore there are still other questions to answer in our recommendation engine, such as how to ensure diversity, adjust the relevance table by users’ explicit feedback, and utilize context information.

 

Face Match System – Clustering, Recognition and Summary

May 4th, 2014 by Cailiang Liu

Following the workflow of the Face Match system, this blog entry introduces the third core technique: face track clustering and recognition.

Track tagging

When a series of face tracks have been extracted from a set of videos, the next step is to tag them automatically with some probable actor names from the given show. After all, manually processing all the tracks from scratch would be infeasible. The tags, with some sort of acceptable accuracy rate — let’s say 80 percent — provide valuable cues for a person to verify the tracks in groups. When presented in a user-friendly interface, the tags also improve the speed required to correct erroneous matches. Given that, we are seeking ways to improve tagging accuracy for face tracks. This naturally falls into the machine-learning framework, which is widely adopted in the computer vision research community. In this blog entry, we refer to this problem of automatically annotating faces (not tracks) as face tagging.

Traditionally, face verification technology tries to identify whether a given image belongs to a specific person in a set of candidates. Though successfully applied in controlled environments, the approach has strict assumptions: the video must have good lighting, the actors must be facing the camera, and their faces cannot be obscured. However, these assumptions do not hold in challenging environments such as TV shows or movies. The general research interest has recently turned toward solving for uncontrolled datasets like “Labeled Faces in the Wild”. And the benchmark of identifying whether two faces belong to the same person has attracted a lot of attention. However, the LFW database contains a lot of person with only two face samples. Thus, the benchmark could hardly cover the case of identifying many people in many poses and true wild environment.

In the machine-learning framework, the problem of track tagging essentially boils down to how to construct a proper track similarity function as a function of the similarity of the faces in the tracks. Because we are facing the largest dataset for face verification in research history, the time and labor for human verification have become the most critical metrics. Only by improving the accuracy of track tagging can we significantly reduce the time and labor for verification. There are a few aspects that impact the results: 1) The features of the toolset; 2) The learning approach; 3) The cold start problem. Still, because of the very large dataset, we are also constrained by the amount of processing time we can afford. Given the potential number of all videos available to Hulu, we need to reduce the processing time to less than one second per face image. Thus, we cannot afford the recent effective, yet heavy, methods such as dense local feature based methods. Next, we will elaborate on each of these aspects.

Features extraction

In the current system, we leverage quite a few kinds of visual information to improve the tagging accuracy. Compared with a single image, we are equipped with the temporal information provided by continuous face tracks. Fusing these tracks into a 3-D face model is an interesting alternative for us to explore in the future. For now, we’ve limited ourselves to select a few representative faces and have constructed the track similarity function as a function of the similarity of the representative faces.

First, we resize the image to ensure the face region is 80×80 pixels. Then we enlarge the selected region to 80×160 pixels by extending 40 pixels up and 40 pixels down, respectively. See Figure 1 for an example.

Standard face features such as global face features and LBP (local binary pattern) facial features are extracted on the global region and local regions respectively. The global face feature is extracted on the aligned faces with a 16×16 grid layout, with each grid containing a 58-dim LBP feature. The LBP facial features are extracted on each local facial window with a 4×4 grid layout and a 58-dim histogram is accumulated for each grid by different LBP code. These local histograms are concatenated into a vector of 928 dims.

A few face verification approaches require face alignment and face warp as a preprocessing step. The alignment process identifies landmark points in the face, e.g. the corners of the eye, the mouth and the nose. Then the face can be warped to the frontal position by triangulating the facial features and finding the affine mapping. Therefore, global face features can be extracted on distorted faces as well. However, in our experiments, we did not see much improvement using this step. This may due to the fragility of the alignment algorithm we used.

We assume that the given character’s appearance will not change often in one video. So we further incorporate a few other features to reflect the character’s appearance, including hair and face, as well as the environment in which he or she appears. More specifically, we extract texture and color features in respective areas of the face image to reflect hair and scenery. The LBP feature is also extracted on the full 80×160 region to represent the face as a whole. The importance weights among different modalities are learned afterward with some label information for face tracks.

fm41

Figure 1. Feature extraction for face tracks

Learning approach

The primary goal in this step is to construct a proper track similarity function as a function of the similarity of the underlying faces across the tracks.

Given a new video, the tracks of an actor usually will be more similar to tracks from the same actor in this specific video than tracks of the same actor from other videos. This is because the appearance of the actor will remain the mostly same in one given video. Thus label information from the current video is more valuable than that from other videos. With these labels, we can expect higher tagging accuracy, so we adopt an online learning scheme to incorporate the newly verified track labels from the current video at the earliest time.

As we need to handle several tens of thousands of actors in our system, building and maintaining a supervised model for all possible actors is infeasible, even though we need to deal with only 100 to 500 actors for a given show. Given online learning and a huge number of candidates, we adopt a k-Nearest Neighbor (kNN) based lazy-learning approach to annotate the faces independently and then vote among the face tags to determine the tag for the given track. The merit of such lazy learning is that we do not need to maintain any learned model and the newly acquired labels can be added instantly. As shown in Figure 2, after feature extraction, approximate kNN scheme is used to speed up the neighbor-finding process. For a face track X, the jth feature of ith face in X is denoted as Xij, and its nearest samples are denoted as fmq14 where S1 is the most similar neighbor and S2 is the second one, etc. Each face is represented by a linear combination of its nearest neighbors. The weight for each neighbor is adopted as the similarity of the target face to neighbor face. L2-norm is used in the similarity metric because L1-norm results in a worse performance and is far less efficient:

fmq1

We treat faces with different poses in the same way since the database is large enough and faces will find neighbor faces with the same pose. With the coefficients bij, we can generate a voting distribution over the identity list aij:

fmq2

To measure the reliability of the voting, we use the sparse concentration index  as confidence scores:

fmq3

In order to fuse fmq17 to label samples Xij, we use the formula fmq9We define weighting function fmq5 where c2 is the part that magnifies votes with large confidence scores and vjk are fixed parameters need to learn. It means that when the confidence score is not high, the vote weight is lower.

Learning voting weights for features with structured output SVM

The standard structured output SVM primal formulation is given as follows:

fmq6

The voting weight w is the stack of vectors vj. To learn w, we define fmq18, where fmq19 is a vector with only y-th row 1, which selects features for class y. And fmq20 maps a track to a matrix with confidences for different identities:

fmq7

Learning a structured output SVM with kernel fmq15 defined above will result in weight vectors that best combine multi-view features in face track recognition. To vote the identity label for a track X, we use a formula as follows:

fmq8

Fusing samples for the track label

One simply way to fuse different samples Xi is to use all identity distributions fmq16 in computing fmq15. However, there are mismatches because many samples are very similar and they all match to faces with wrong identities. In order to avoid these mismatches, we adopt a diversity-sampling algorithm GRASSHOPPER to select diverse samples. We define the similarity function for GRASSHOPPER:

fmq10

where fmq11 are the most similar neighbor of fmq12.

Finally the label of the face track X is computed using the formula:

fmq13

Experiments show that, with sufficiently large face databases, the precision of automatic track tagging would be as high as 95 percent when annotating 80 percent of the face tracks. For some high-quality episodes, the system is able to annotate 90 percent of face tracks with 95 percent accuracy. This significantly reduces the time required for manual confirmation.

After automatic tagging, the face tracks are clustered with respect to visual similarity and presented to human annotators for verification. The corrected labels are fed back into the system to further improve the tagging accuracy.

Cold start

The cold start phenomenon is frequently discussed in the research for recommendation systems. Due to lack of information for a newcomer to the system, no cue is available for deciding which items to recommend. Similarly, when a new show or a new actor comes into our system, we have no labeled information, and thus supervised learning is not feasible. In such a situation, we resort to unsupervised/semi-supervised learning approaches to provide the initial labels for a few tracks to the system.

Simple unsupervised hierarchical clustering is possible, but we can do better than that. Though we do not have label information for a new show or a new actor, we do have labels for other actors in other shows. Thus, with a few pre-built classifiers for each of the known actors, we construct a similarity vector to measure the similarities of the current track to the given set of known actors. See Figure 2 for details, where the small graph illustrates an example of one track’s classification scores to a list of known actors. Arguably, this similarity vector encodes some prior knowledge in the system, so we expect this semi-supervised learning scheme will outperform the unsupervised scheme. Experimental results show that the semi-supervised scheme increases 30 percent of the purity score for the clusters over the unsupervised scheme.

fm42

Figure 2. Computing track similarities (with respect to known actors) for face track clustering

Lessons learned

  • Combining face features and context features for hair and clothes improves annotation accuracy.
  • The online active learning scheme shows better results than offline ones.
  • Confirmation is an easier and faster task than annotation for humans. More accurate prediction results help a lot in reducing confirmation time.
  • Grouping visually similar tracks together for confirmation lightens manual workload and significantly reduces human reaction time.
  • The semi-supervised scheme helps solve the cold start problem, and therefore helps annotation.

Our exploration is a preliminary investigation of the track-tagging problem. This is an interesting open research problem and we will continue to improve the annotation accuracy.

This is the 4th blog of Face Match tech blog series. You can browse the other 3 blogs in this series by visiting:

1. Face Match System – Overview

2. Face Match System – Face Detection

3. Face Match System – Shot Boundary Detection and Face Tracking

Last comment: about 8 hours ago 1 Comment

Face Match System – Shot Boundary Detection and Face Tracking

May 3rd, 2014 by Tao Xiong

Following the workflow of the Face Match system, this blog entry introduces the second core technique: face tracking with shot boundary detection.

Shot boundary detection

What is shot boundary detection?

A video is usually composed of hundreds of shots strung into a single file. A shot is composed of continuous frames that are captured in one camera action. Shot boundary detection is used to locate the accurate boundary between two adjacent shots. There are several kinds of boundaries between two adjacent shots, but they can generally be categorized into two types: abrupt transition (CUT) and gradual transition (GT). CUT is usually easy to detect since the change on the boundary is great. Considering the characteristics of different editing effects, GT can be further divided into dissolve, wipe, fade out/in (FOI), and so forth. For GT, there is a smooth transition from one shot to another, which makes it more difficult to determine the position of the boundary. Additionally, it can be difficult to tell the difference between GT and fast movements in a single shot, since the variation of content in both cases is smooth.

Why is shot boundary detection needed?

Shot boundary detection is widely useful in video processing. It is a preliminary technique that can help us to divide a long and complex video into relatively short and simple segments.

In Face Match, shots are the basic units for face tracking as they provide an effective tool to restrict a face track that may drift across multiple shots.

How do you achieve shot boundary detection?

Three steps are required for shot boundary detection:

1. Extract features to represent the video content.

To find the shot boundary, the video is analyzed frame-by-frame. The color vector composed of color values of all pixels in a frame is not good enough to determine a shot change since it’s very sensitive to movement and illumination. Therefore, histogram features for both colors in the HSV color space and textures with the local binary pattern (LBP) descriptor are extracted. LBP reflects a local geometric structure and is less sensitive to variations in global illumination.

2. Compute the measurement of continuity.

Continuity measures the similarity between adjacent frames. On the shot boundary, the continuity should have a low value. Using this measurement, content within a video can be transformed into a one-dimensional temporal signal. If the measurement is only associated with two adjacent frames, it is hard to detect the GT since the variation between two adjacent frames is small. Thus, a larger time window is used. In this window, K frames lay along the time axis. See Figure 1 below for an example. Their all-pair similarity can be computed. A graph can be constructed by these K frames with K*(K-1) edges valued by the similarity, as demonstrated below. We’ve adopted histogram intersection as the similarity measure, weighted by the distance between two frames in a pair.

fm_fig3_1

Figure 1. The graph with K*(K-1) edges (only part of edges are shown) and the K*K weight matrix

The normalized cut CN of this graph is calculated as the continuity value of the middle frame in this window where

fm_equ3_1

Since color and LBP histograms are both employed, two curves can be obtained. The results are combined by multiplying them together.

3. Decide the position (and type) of the shot boundary.

There are two approaches to determine the shot boundary. The first uses a pre-defined threshold to classify the curve into two categories. The second relies on machine-learning techniques to train a classifier. As we lack enough training data, we selected the first approach.

Face Tracking

What is face tracking?

Face tracking is the tracking of the human face in a video or a continuous image sequence from a start point (with parameters such as position, scale, rotation, expression, etc.) given by face detection and even face alignment techniques (Figure 2).

Face tracking may be implemented online or offline. In online mode, a face is tracked while the video is being captured. Thus, only current and previous frames can be used to exploit information for tracking and the efficiency requirement is strict. In offline mode, the whole video file is generated ahead of time. Therefore, the information of any frame can be used to guide the tracking.

In Face Match, since the video has been obtained beforehand, we implement tracking in offline mode, and only the position and scale of the face are concerned.

fm_fig3_2

Figure 2. Illustration of face tracking

Why is face tracking needed?

Video is generally composed of tens of thousands of frames. To find as many faces as possible in each frame, one option is to perform face detection frame-by-frame. Given that it takes 0.3 seconds for a frame sized 640×360, processing a video is more than eight times slower than video playback. Thus, it is not feasible in practice.

Considering the continuity of video along the time axis and the redundancy between adjacent frames, face tracking can be employed instead of face detection in each frame. Since face tracking is very efficient, the time cost can be significantly reduced. Moreover, the faces of the same person in consecutive frames can be linked together. Thus, for each face track, only representative face samples are needed in subsequent face clustering or tagging steps, which can dramatically decrease processing time. Moreover, face tracking can help recover more difficult-to-detect faces.

How do you achieve face tracking?

There are several mature standard models designed for object tracking, such as optical flow, mean shift and particle filter. Considering the efficiency of processing thousands of videos, we adopted the optical-flow based tracker. In order to do so, we follow the Kanade–Lucas–Tomasi tracker, which is based on the object appearance and nonlinear, least-square optimization. If the appearance of the object changes only slightly over time, the tracking performance will be very good. It’s also able to handle many motion parameters in addition to transition and scale, 3D rotation angles and expression parameters (e.g. Active appearance models). By adopting inverse compositional techniques, the solving process of optical flow is very efficient.

Optical flow based tracker makes use of continuity of adjacent frames with three assumptions:

  1. Assume the appearance of the target object is similar or the same in adjacent frames
  2. Assume the target object should have abundant texture
  3. Assume the variation of pose parameters (translation, scaling, rotation) should be small

For face tracking in a video stream, these three assumptions are usually satisfied.

Given a face box in the first frame, optical flow minimizes the appearance difference between face areas in adjacent frames to find the best face box in the next frame. The parameters to describe a face box in our application are translation and scale. To solve a non-linear, least-square problem, the parameters can be obtained iteratively. Some further considerations are:

  • To alleviate the sensitivity of illumination, we normalize the intensity of gradients  fm_equ3_3 as appearance descriptor fm_equ3_4, since it is also simply computed. The original intensity of gradients is normalized by a sigmoid function to limit its dynamic range in [0, 1].

fm_equ3_2

  • To cover large displacement of face both in and out of the image plane, a multi-resolution strategy with pyramid structure is employed.
  • Two-step tracking strategy is proposed: 1) track only the translation of the face area using pyramid structure; 2) track translation and scale synchronously in single resolution.
  • To avoid the track as it drifts into background, an online learning model is adopted in the second step above. Each pixel in the appearance of face area is modeled as Gaussian distribution with the mean and variance updated during the tracking. If the track error is greater than a pre-defined threshold, the tracking is terminated.

The preprocessing is face detection and shot boundary detection. Face detection provides a start for face tracking and shot boundary detection limits the face tracks laid in the same shot. Thus, before tracking, in each shot, there are several detected face boxes in different frames. We iteratively associate the detected faces into longer tracks and extend the connected tracks with further tracking. This finishes the step of tracking.

This is the 3rd blog of Face Match tech blog series. You can browse the other 3 blogs in this series by visiting:

1. Face Match System – Overview

2. Face Match System – Face Detection

4. Face Match System – Clustering, Recognition and Summary

Face Match System – Face Detection

May 3rd, 2014 by Tao Xiong

Following the workflow of the Face Match system, this blog entry introduces the first core technique: face detection.

Face detection

How does the system identify which faces to detect?

Face detection is an essential step in face tagging. The detection rate strongly correlates with the final system recall of faces. We take careful steps to detect frontal faces as well as profile faces because the latter are indispensable for recalling whole-profile face tracks, which are abundant in premium videos. The detection of rotated faces is also a necessity. See Figure 1 below for an illustration of face poses we strive to detect with our detector, where yaw refers to different profile degrees ranging from -90 to 90 degrees, and rotation refers to in-plane rotation from -90 to 90 degrees. We do not run the full range of in-plane rotation due to efficiency considerations.

fm_fig2_1

Figure 1. Out-plane and in-plane rotations of human face

Incorporating such variances in the detector complicates the architecture design. We need to carefully design the algorithm and parameters to achieve balance among accuracy, false detection rates and running speed. Please remember that the detector is the most time-consuming feature in the whole system.

Building a multi-view face detector

Face detection is a well-studied problem with a long research tradition. The state-of-the-art detector follows the sliding window approach to exhaustively scan all possible sub-windows in one image, and it relies on cascade-boosting architecture to quickly filter out negative examples. See Figure 2 for an illustration of the cascaded classifiers. Each stage (denoted as 1, 2, 3, etc.) is a classifier, which scores the sub-windows. The windows with scores below a certain threshold are discarded and only those with larger scores are passed. Thus with carefully designed classifiers, we can safely filter a certain portion of negative examples without falsely rejecting many truly positive examples. Though the number of sub-windows is huge for an image, most of the windows are negative examples and will run through only one or two stages. Thus the process is quite efficient for a single-face pose.

 fm_fig2_2

Figure 2. Cascade classifiers

However, parallel processing the different detectors for multiple poses ignores the structure of the face pose space and is inefficient. To facilitate the feature- and detector-sharing among different poses, various hierarchical detector structures have been proposed and implemented. We chose the pyramid structure for its simple and independent training process for the underlying component detectors. Pyramid structure is a coarse-to-fine partition of multi-view faces. See the following Figure 3 for an illustration of the yaw based partition process.

 fm_fig2_3

 Figure 3. Partition process of yaw angle

Our situation is a bit more complex since we need to deal with in-plane rotation and yaw rotation at the same time. Thus a branching node is needed to decide whether a given example will go to the in-plane rotation branch or the yaw rotation branch (Figure 4). More specifically, we train a five-stage all-pose face/non-face detector as the root node. Then we train two detectors for in-plane rotation and yaw rotation respectively, each with ten stages. The outputs of these two detectors are compared to select a subsequent branch hereafter. After that, the problem is converted to the solved problem of rotation in one dimension, be it in-plane rotation or yaw rotation. In a given branch, the same coarse-to-fine strategy is used. The final output incorporates both face position and pose estimation.

fm_fig2_4

Figure 4. Whole face detector structure to handle multi-view faces

Usually for face detectors, Haar wavelet features are typically used in face detection because they are simple and fast. However, they often contain dimensions ranging in the tens of thousands. In contrast, the local binary pattern (LBP) feature is only a 58-bin sparse histogram. It captures the local image’s geometric structure and is less sensitive to global illumination variations. Thus, we’ve adopted the LBP histogram for our system.

We’ve also integrated the boosting framework for training the classifier stages. We use a RankBoost like reweighting scheme in each round to balance the weights for positive and negative examples. This is useful to tune the classifiers to focus more on the limited positive examples. We also follow the nested cascade structure to further reduce the number of weak classifiers needed in the detector.

Synthetic examples like flipped, rotated version of faces with random small positions, scale and rotational transformations are created to enlarge the face dataset. In training, multiple threading techniques make the process more quickly.

Our multi-view face detector can detect faces in about 300ms for 640×360 images. The accuracy is about 80 percent for frontal faces and 60 percent for profile faces, both at 5 percent false detection rate.

This is the 2nd blog of Face Match tech blog series. You can browse the other 3 blogs in this series by visiting:

1. Face Match System – Overview

3. Face Match System – Shot Boundary Detection and Face Tracking

4. Face Match System – Clustering, Recognition and Summary

Last comment: about 12 hours ago 1 Comment

Face Match System – Overview

May 3rd, 2014 by Zhibing Wang

Motivation

We must confess: sometimes even we have a hard time recognizing actors in TV shows and movies. Sometimes the name is right on the tip of our tongues, but we still don’t know it. It’s even more difficult with some foreign actors. But, if there was a way for a video to provide detailed metadata about an actor whenever he or she pops up in a video, Hulu users would have the benefit of having that information displayed from the Hulu player, with the option to learn more about the actor they’re interested in whenever they wanted.

From another point of view, general multimedia content analysis remains an unsolved problem — even with significant progress made in the past 20 years. However, unlike general content analysis, face-related technologies like face detection, tracking and recognition have recently matured into consumer products. The combination of these types of advances in technology with our relentless pursuit to enhance the user experience at Hulu, is where the idea of “Face Match” originated.

System design

When first examining the problem, one solution would be to examine all frames of the video and use human effort to exhaustively annotate all the faces that appear in these frames. However, this method would not be scalable for billions of videos on the Internet. Another extreme would be to detect faces in each image and let an algorithm automatically detect and identify the faces. However, the bottleneck of this approach is that current recognition algorithms can only achieve approximately 80% accuracy at best — which is far below the minimal user expectation. Taking both of these methods into account, it became apparent that the best solution would be to combine the merits of each and find a way to minimize the human effort to the lowest level.

Our system was designed to carefully balance the computational complexity while also minimizing human effort. As shown in Figure 1, the Face Match platform contains two main parts: the initial segment and the auto-tag-cluster-confirm cycle. For each video, faces are detected and grouped into face tracks/groups. The details of these technologies are described in the next paragraphs.

To minimize the amount of human effort required to label each individual face, visually similar face tracks are grouped via clustering. Thus, a human can select a number of face tracks at a given time and label all of them in one fell swoop.

For each show, the system first collects celebrity information from the web. Then, for initial videos in each show, 20 percent of face tracks are clustered and left for manual labeling. These bootstrapped celebrity labels are helpful in supervised track tagging. Though all face tracks can be clustered and simply left for manual labeling, this leads to a heavy workload. To improve the efficiency of human annotation, we’ve introduced an auto-tag-cluster-confirm cycle. With the bootstrap labels, the system can learn predictive models for celebrities. The models predict unlabeled tracks that are left for human confirmation. As the pool of celebrity labels grows with each iterative cycle, the system is able to learn face models with better precision. In the front end, displaying a large number of a celebrity’s face tracks for manual confirmation would be inefficient since a human still needs seconds to verify each face track. Similar to the initial annotation process, the system also clusters visually similar face tracks together. Thus, humans can confirm a number of tracks in one simple click, with one quick glance.

fm_fig1Figure 1. Overview of the system design. A.) Face groups/tracks are detected and extracted for each video; B.) For each show, celebrity information is collected and the initial pool of face tracks (20 percent) are clustered for bootstrap labels by user annotation; C.) Automatic face track tagging is introduced in auto-tag-cluster-confirm cycle to minimize the human effort.

To detect and connect faces and place them into tracks, we leverage face detection and tracking algorithms. We’ve also trained a multi-view face detector for 180-degree plane rotation and 180-degree yaw changes with about ten thousand labeled examples. The face detection algorithm needs roughly 300 milliseconds to process a 640×360 frame (running on PC with Xeon(R) CPU E5420 at 2.50GHz). Thus, detecting all video frames would consume nine times the amount of real-time processing — which is unacceptably slow. Our tracking system rescues us from such heavy computation and associates isolated faces into continuous tracks. It can also extend face tracks to frames well beyond the initial detection result, which increases the whole system recall at moments when the detector fails to detect existing faces. This also effectively reduces the number of candidates for later face tagging by a factor of 100. As a result, we only need to tag face tracks and not isolated faces. To avoid the “drifting away” phenomenon in tracking, shot boundaries are detected and incorporated as well.

In automatic track tagging, we also take advantage of the multi-sample (multiple faces per face track), multi-view features (clothes, hair, facial and contextual features). As in Figure 2, where the pipeline of automatic track tagging is shown, the system first builds databases with annotated tracks. Then for each face track, the system extracts multi-view features for all samples in the track. And, for each face and each feature, the system finds its nearest neighbors via ANN and decomposes it as a linear combination of its neighbors. Finally, the identity coefficient distributions for all faces and all features are aggregated for the final tagging of this track. For the details of the algorithm, please refer to the Track Tagging section.

fm_fig2 Figure 2. Algorithm pipeline for the track tagging method

Processing pipeline

As shown in Figure 1a), when a video is about to be released, the system first determines the shot boundaries and densely sample frames to apply the multi-view face detector to each frame. The detected faces provide a starting point for face tracking. Then, tracking algorithms associate the isolated single faces into connected face groups of the same person. With the extracted tracks, clustering algorithms group similar tracks for user annotation. Or, a track tagging stage can also automatically tag actor candidates on each track for user confirmation. Finally, the face tracks are left for human annotation or confirmation.

Combining these steps altogether, we can automatically tag tracks of one video in real-time. For a common TV show, 80 percent of the face tracks can be successfully picked out for processing with 5 percent false positive samples. After that, we would still require human intervention to verify the final results.

In the next three blogs, four core techniques — face detection, face tracking with shot boundary detection, face track clustering, and face track recognition — will be introduced. The annotation step is omitted since this blog covers only technical algorithms.

Last comment: about 13 hours ago 3 Comments