At Hulu, quality is at the heart of everything we do. It's right at the top of our team credo: "What defines Hulu". Last year we formed a new Quality of Service team, dedicated to ensuring our video playback always looks great across every device on every Internet connection. This is especially challenging when you consider how many potential points of failure exist in our video publishing pipeline and request chain:
- File integrity: Files delivered from our content providers are transcoded and uploaded to our Content Delivery Networks (CDNs). During this process, they are moved multiple times, risking corruption and forcing us to balance the overhead of running MD5 hash checks against the throughput of the pipeline.
- CDN cache: Once on the CDN, cache misses force the stream to be served from the origin, introducing intra-CDN network latency and slowing end-user delivery.
- Network routing: Depending on the transit and peering relationships established by the CDN, the video may be routed inefficiently over the Internet to the end user.
- ISP congestion: Once the video reaches the ISP network, it may encounter congestion in the last mile, causing added latency and dropped packets.
- Video playback: Once the client receives and buffers the video, the local CPU is taxed both by the decryption and decoding overhead necessary to render the video in the player, often resulting in dropped frames.
Oh, and you have to add to this the complexity of serving over a variety of protocols: RTMPE, HLS, HTTP progressive, etc. Each requires specific configuration, such as optimizing our Flash Media Server (FMS) API interactions or ensuring a device player is sufficiently tolerant to handle a corrupted HLS video chunk. Any of the problems listed above can lead to user visible playback errors. Our first goal was to create a set of dashboards that give us insight into the errors our users are experiencing and detailed diagnostic information to address them.
To this end, we recently launched a new internal Quality of Service Portal, shown below. The portal tracks key quality metrics that impact the streaming experience for users, breaking them down by platform, region, and timeframe, and enabling us to track them over time. These include:
- Rebuffering rates: The percent of sessions that stall because the player's buffer has under-run, leaving the user to watch a spinning circle icon instead of their video.
- Incorrect ends: The percent of sessions in which the video stops unexpectedly, requiring a user to re-start the stream in order to continue watching their video.
- Skip sessions: The percent of sessions in which the video skips ahead a few seconds unexpectedly.
We recognized early on that if we wanted to ensure a great streaming video experience, we had to deepen our relationships with our CDN partners. The portal enables these partners to log in externally and pull the graphs themselves. But insight alone is rarely sufficient to address problems that arise. We also introduced functionality to make the data actionable. When our CDN partners discover an anomaly in performance, they navigate to the "Logs" tab and pull records with detailed session data. This helps them isolate issues in specific hosts and improve network performance.
Architecturally, the primary challenge in building this system was scaling our data ingest capacity and backend data aggregation capabilities. The players built into each of our device endpoints (Flash in the browser, iOS, PS3, Xbox, etc) periodically send "beacons" to our QoS front ends, reporting a wide variety of statistics in a common JSON format. Because most of our usage comes from the US, there's a large delta between our baseline and peak usage throughout the day. To accommodate changing traffic patterns, we use virtual machines to create on-demand ingress capacity, scaling up our first tier dynamically as needed.
The first tier parses the beacons, validates structure and content, and filters out bad records. It then forwards the data to our second tier, comprised of MongoDB instances. Each beacon has a GUID which is modded to distribute load evenly across this persistence layer. Once written to memory, the second tier runs a periodic batch process using Mongo's native mapreduce framework to aggregate the session data across a number of keys that ultimately enable low latency queries from the portal UI. The aggregated data is then pushed to a third MongoDB tier that persist it for serving to the portal. See this diagram for a high level visual representation:
Now that the initial version of the portal is in production, our next step is to enable greater insight at a finer level of granularity. We plan to add new metrics such as detailed player load latency, average bitrate, and CDN failover. We believe with a continued focus on quality of service, we'll see significant improvement in our video delivery in 2012.
Joshua Siegel is Senior Program Manager for Hulu's Content Platform and Video Distribution, guiding development of the technology that gets content from our Media Partners to our users.
Greg Femec is lead engineer on our Quality of Service initiatives in addition to launching Hulu Latino and working on third party authentication services.