• Hulu
  • TV
  • Movies
  • More TV. On more devices.
Search
Hulu Tech Blog
Get this RSS feed

The Search for the Perfect Stream: Hulu’s New Quality of Service Portal

February 21st, 2012 by Josh Siegel

At Hulu, quality is at the heart of everything we do. It’s right at the top of our team credo: “What defines Hulu”. Last year we formed a new Quality of Service team, dedicated to ensuring our video playback always looks great across every device on every Internet connection. This is especially challenging when you consider how many potential points of failure exist in our video publishing pipeline and request chain:

  • File integrity: Files delivered from our content providers are transcoded and uploaded to our Content Delivery Networks (CDNs). During this process, they are moved multiple times, risking corruption and forcing us to balance the overhead of running MD5 hash checks against the throughput of the pipeline.
  • CDN cache: Once on the CDN, cache misses force the stream to be served from the origin, introducing intra-CDN network latency and slowing end-user delivery.
  • Network routing: Depending on the transit and peering relationships established by the CDN, the video may be routed inefficiently over the Internet to the end user.
  • ISP congestion: Once the video reaches the ISP network, it may encounter congestion in the last mile, causing added latency and dropped packets.
  • Video playback: Once the client receives and buffers the video, the local CPU is taxed both by the decryption and decoding overhead necessary to render the video in the player, often resulting in dropped frames.

Oh, and you have to add to this the complexity of serving over a variety of protocols: RTMPE, HLS, HTTP progressive, etc. Each requires specific configuration, such as optimizing our Flash Media Server (FMS) API interactions or ensuring a device player is sufficiently tolerant to handle a corrupted HLS video chunk. Any of the problems listed above can lead to user visible playback errors. Our first goal was to create a set of dashboards that give us insight into the errors our users are experiencing and detailed diagnostic information to address them. To this end, we recently launched a new internal Quality of Service Portal, shown below. The portal tracks key quality metrics that impact the streaming experience for users, breaking them down by platform, region, and timeframe, and enabling us to track them over time. These include:

  • Rebuffering rates: The percent of sessions that stall because the player’s buffer has under-run, leaving the user to watch a spinning circle icon instead of their video.
  • Incorrect ends: The percent of sessions in which the video stops unexpectedly, requiring a user to re-start the stream in order to continue watching their video.
  • Skip sessions: The percent of sessions in which the video skips ahead a few seconds unexpectedly.

We recognized early on that if we wanted to ensure a great streaming video experience, we had to deepen our relationships with our CDN partners. The portal enables these partners to log in externally and pull the graphs themselves. But insight alone is rarely sufficient to address problems that arise. We also introduced functionality to make the data actionable. When our CDN partners discover an anomaly in performance, they navigate to the “Logs” tab and pull records with detailed session data. This helps them isolate issues in specific hosts and improve network performance.

Architecturally, the primary challenge in building this system was scaling our data ingest capacity and backend data aggregation capabilities. The players built into each of our device endpoints (Flash in the browser, iOS, PS3, Xbox, etc) periodically send “beacons” to our QoS front ends, reporting a wide variety of statistics in a common JSON format. Because most of our usage comes from the US, there’s a large delta between our baseline and peak usage throughout the day. To accommodate changing traffic patterns, we use virtual machines to create on-demand ingress capacity, scaling up our first tier dynamically as needed. The first tier parses the beacons, validates structure and content, and filters out bad records. It then forwards the data to our second tier, comprised of MongoDB instances. Each beacon has a GUID which is modded to distribute load evenly across this persistence layer. Once written to memory, the second tier runs a periodic batch process using Mongo’s native mapreduce framework to aggregate the session data across a number of keys that ultimately enable low latency queries from the portal UI. The aggregated data is then pushed to a third MongoDB tier that persist it for serving to the portal. See this diagram for a high level visual representation:

Now that the initial version of the portal is in production, our next step is to enable greater insight at a finer level of granularity. We plan to add new metrics such as detailed player load latency, average bitrate, and CDN failover. We believe with a continued focus on quality of service, we’ll see significant improvement in our video delivery in 2012.

Joshua Siegel is Senior Program Manager for Hulu’s Content Platform and Video Distribution, guiding development of the technology that gets content from our Media Partners to our users.

Greg Femec is lead engineer on our Quality of Service initiatives in addition to launching Hulu Latino and working on third party authentication services.

Simulating HTTP LiveStreaming (HLS) – A way to ensure video playback works great

July 25th, 2011 by Ludo Antonov

Hulu Plus launched out of beta in November of last year and it’s currently available on a number of mobile and living room connected devices. One of the core delivery protocols that Hulu relies on for video streaming is called HTTP Live Streaming (HLS). It’s documented by Apple Inc. (see IETF) and HLS is now widely used—already available as a video delivery vehicle on some of the major devices on the market, like the iPhone/iPad, Sony PlayStation®3, Roku, Android 3.0, and more.

In a nutshell, the HLS protocol delivers video over HTTP via a playlist of small segments that are made available in a variety of bitrates from one or more delivery servers. This allows the playback engine to switch on a segment-by-segment basis between different bitrates and content delivery networks (CDN). It helps compensate for some of the network variances and infrastructure failures that might occur during playback.

But what determines a quality viewing experience—and uninterrupted video playback? That largely depends on the client’s playback engine. Different device platforms usually have different playback engines, each with its own implementation of the HLS protocol. Such protocol implementations often differ not only in completeness of the HLS implementation, but also in streaming and bitrate switching heuristics—like choosing how to act under different network conditions, for example. These differences are especially noticeable with sudden changes in the stability of the network, when playing content on low-bandwidth networks, and with partial failures in the video delivery infrastructure.

Hulu reaches millions of consumers every month, so we’re exposed to a wide spectrum of network conditions. Hitting edge playback scenarios is not uncommon.  We provide a client-facing application, so maintaining the best viewing experience possible, even in suboptimal conditions, ultimately falls as the responsibility of the Hulu Plus app. When a user reports a problem with playback, we need to be able to simulate the user’s environment (network conditions, device, content played) so we can determine the root cause of the problem and find out whether there is a viable solution for it. Sometimes this will result in discovering issues with the playback engines. Unless we can reproduce these sorts of problems, it would be virtually impossible to hand them off to a device manufacturer to fix. Other times, the right solution would be to make the UI more forgiving, to make the app smarter about recovering from unexpected failures. Either way, having the ability to reproduce the troubled scenarios is key to taking appropriate action.

At Hulu, to address the need of being able to reproduce the kinds of scenarios, we recently started working on an infrastructure service called DripLS (abbreviated from Drip LiveStreaming). The purpose of the service is to traffic-shape a video stream in accordance to a set of rules. It’s an attempt to simulate real-world network conditions to help ensure that clients and streaming engines degrade gracefully and deliver the best viewing experience possible. DripLS acts as an intermediary between the server hosting HLS video segments and the HLS client, caching segments that need to be traffic shaped and rewriting the m3u8 playlists that the HLS clients receive. The basic flow of the service is outlined in Figure 1.

Figure 1. DripLS workflow

For example, DripLS allows us to simulate a sudden network drop that will cause the video playback to “stall.” It can simulate missing segments that will cause a playback “skip,” too, or simulate a mid-stream CDN failure, thus exercising CDN fallback scenarios. It’s also capable of serving video files as they would be transmitted on a low-bandwidth or “lossy” network. DripLS has almost countless useful applications for validating video playback, and these are just a few of the ones we’ve been able to capture and experiment with since we built the service. The results have already helped us in making streaming more reliable and resilient to failures and making our client-side monitoring infrastructure more aware of these problems when they occur in production.

How does it work?

DripLS appears as a normal HLS endpoint that can be used directly by any HLS client. This allows the service to be ready for use without additional provisioning by any device that supports HTTP Live Streaming. To achieve the desired traffic shaping, the URL to DripLS can be given a set of rules via its query string, which control how the incoming stream will be shaped. The DripLS URL is in the following format:

http://<dripls-host>/master.m3u8?authkey=<authkey>&cid=<cid>&[r=<rule-expression>~<action>,…]

A sample of how an actual DripLS URL might look like would be:

http://<dripls-host>/master.m3u8?authkey=<authkey>&cid=<cid>&r=650k~e404,1500k.*~e500,cdn1.*.s2~net10loss1

In the example above, the transmitted stream, denoted by cid (content id), is instructed to return an HTTP error code 404 for the variant playlist encoded at 650kbit/s bitrate as well as return HTTP error code 500 for all video segment files in the 1500kbit/s bitrate playlist. Additionally, segment 2 from CDN 1 in all variant bitrate playlists will be transmitted back at 10kb/s with 1% packet loss.

DripLS supports two sets of rule classes: e<> and net<>. Matches from the e<> class result in direct rewrites of URLs in the HLS m3u8 playlists to specific URLs that raise the specified HTTP error code. Matches from the net<> class are a little more involved and result in caching and transmitting the matched segments under the rule specified network conditions.

Technology

DripLS uses a combination of technologies to achieve the desired traffic-shaping effect. Under the hood, the current setup consists of two nginx sites that proxy between each other on different ports, and ultimately forward to a cherrypy server that handles the business logic for DripLS (all on a single machine). The segment request always comes through the first nginx site that listens on port 80, which then proxies to the second nginx site on an arbitrary (already pre-shaped for the segment) port, which ultimately forwards to the cherrypy instance. The reason this setup is needed is that, in order to attain the desired traffic shaping, DripLS makes use of tc (traffic control), netem (Linux kernel module), and iptables (network rule chaining), for which the smallest level of granularity is a port.

The basic architecture of the service is shown on Figure 2.

Figure 2: DripLS architecture

Every time an HLS segment is to be traffic-shaped, it’s done exclusively on a port, which is reserved for the segment transmission to the client. The port is shaped via a small custom shell script (see set_ts_lo.sh), in accordance with the desired traffic-shape rule that the segment matched. The URL for the segment is then rewritten in a way that the front nginx site can do a location proxy_pass to the second nginx site, which would accept the request on the already-shaped port. So when the transmission of the segment’s data starts, netem/iptables will make sure that it adheres to the already applied network rules for the port.

#!/bin/bash

if [ $# -ne 3 ]
then
  echo "Usage: `basename $0` port <speed-limit-in-kbps] [packet-loss]"
  exit $E_BADARGS
fi

hexport=$(echo "obase=16; $1" | bc)
netem_loss_handle="$12"

# Add main classes
/sbin/tc qdisc add dev lo root handle 1: htb
/sbin/tc class add dev lo parent 1: classid 1:1 htb rate 1000000kbps

echo "------- Remove any previous rule"
# Delete any old rules (if rules are missing , failure in these commands can be expected)
/sbin/tc qdisc del dev lo parent 1:$hexport handle $netem_loss_handle
/sbin/tc filter del dev lo parent 1:0 prio $1 protocol ip handle $1 fw flowid 1:$hexport
/sbin/tc class del dev lo parent 1:1 classid 1:$hexport
/sbin/iptables -D OUTPUT -t mangle -p tcp --sport $1 -j MARK --set-mark $1

echo "------- Adding rule"
# Add the new rule
/sbin/tc class add dev lo parent 1:1 classid 1:$hexport htb rate $2kbps ceil $2kbps prio $1
/sbin/tc filter add dev lo parent 1:0 prio $1 protocol ip handle $1 fw flowid 1:$hexport
/sbin/tc qdisc add dev lo parent 1:$hexport handle $netem_loss_handle: netem loss $3%
/sbin/iptables -A OUTPUT -t mangle -p tcp --sport $1 -j MARK --set-mark $1

set_ts_lo.sh – Script to simplify interaction with netem, tc, iptables

Although DripLS can be used as a remote cloud service, running the service and the device on the same network helps avoid “last mile” deviations from normal alterations in the network between the service and the device. Despite this recommendation, running DripLS on a remote network has yielded consistent results so far for us. Simulations via DripLS are an alternative to hardware based testing, which is a common way to validate network alterations. The DripLS approach has several key advantages. Namely it allows multiple developers to use the service at once; it allows precise and consistent simulations; it is easier to test variety of scenarios; it requires little-to-no setup; it can test on any network; and last but not least it is possible and easy to share pre-shaped streams with partners.

Future

Currently, we’re using DripLS mainly for manual testing and ad-hoc reproduction of some interesting playback scenarios. When we receive major device firmware upgrades we touch up on the basics and the more common edge case scenarios using the service. We also want to expand DripLS capabilities with support for additional delivery protocols in the future. We’re in the process of deeply integrating our device tests with DripLS, and also arriving with a more standardized set of Acid tests that we can execute across a variety of devices. These efforts will help us establish a level of confidence that the playback engine on a device—and the Hulu app running on top of it—are able to cope with a variety of network conditions and playback scenarios.

Use it, and make it better

DripLS has been so useful for us, that we decided to share it with the world as an open-source tool. You can find DripLS on GitHub at https://github.com/hulu/DripLS – please feel free to fork, comment, improve, fix, and repurpose as you see fit. We also welcome your comments at the discussion group at http://groups.google.com/group/dripls-dev – please let us know if you use DripLS, how you like it, and what changes you’d like to see.

Ludo Antonov, a software engineer, is building things that make your brain go mushi-mush.
* Header image by Branden Williams, licensed CC-BY 2.0 (see http://www.flickr.com/photos/captbrando/3336992646)

 

Last comment: Sep 6th 2014 2 Comments