As the Internet gets more and more popular, information overload poses an important challenge for a lot of online services. With all of the information pouring out from the web, users can be overwhelmed and confused as to what, exactly, they should be paying attention.
A recommendation system provides a solution when a lot of useful content becomes too much of a good thing. A recommendation engine can help users discover information of interest by analyzing historical behaviors. More and more online companies — including Netflix, Google, Facebook, and many others — are integrating a recommendation system into their services to help users discover and select information that may be of particular interest to them.
With literally tens of thousands of hours of premium video content, Hulu users are also prone to content overload. Given the wide variety of content available on the service at any one time, it may be difficult for Hulu users to discover new video that best matches their historic interests. So the first goal of Hulu’s recommendation system is to help users find content which will be of interest to them.
In addition to users, Hulu’s recommendation system should also help content owners promote their video. Part of our mission is to deliver a service that users, advertisers, and content owners all unabashedly love. We have many different content partners, and we understand that these content partners want to more Hulu users to watch their videos — especially when new videos are released. By using personal recommendation instead of more traditional recommendation systems, we can promote video content more effectively since we will promote directly to users who are likely to enjoy the content we are recommending.
Before explaining the design of our recommendation system, we wanted to explain some characteristics of our data.
Since a lot of our content is comprised of episodes or clips within a show, we have decided to recommend shows to users instead of individual videos. Shows are a good method of organization, and videos in the same show are usually very closely related.
Our content can be mainly divided into two parts: on-air shows and library shows. On-air shows are highly important since more than half of our streaming comes from them.
Although on-air shows occupy a large part of our content, they are touched by a seasonal effect. During summer months, most of on-air shows do not air, causing on-air show streaming to decrease. Furthermore, there are fewer shows aired during weekends, thus the streaming of library shows will increase. Keeping this information in mind we can design the recommendation system to recommend more library shows to users during the weekend or summer months, as an example.
The key data that drives most recommendation systems is user behavior data. There are two main types of user behavior data: implicit user feedback data and explicit user feedback data. Explicit user feedback data primarily includes user voting data. Implicit feedback data includes information on users watching, browsing, searching, etc. Explicit feedback data can show a user’s preference on a show explicitly, but implicit feedback data cannot. For example, if a user gives a 5-star rating to a show, we know that this user likes the show very much. But if a user only watches a video from a show page or searches for a show, we don’t know whether this user likes the show.
As the quantity of implicit data at Hulu far outweighs the amount of explicit feedback, our system should be designed primarily to work with implicit feedback data.
There are many different types of recommendation algorithms, and perhaps the most famous algorithm is collaborative filtering (CF). CF relies on user behavior data, and its main idea is to predict user preferences by analyzing their behaviors. There are two types of CF methods: user-based CF (UserCF) and item-based CF (ItemCF). UserCF assumes that a user will prefer items which are liked by other users who have similar preferences to that user. ItemCF assumes that a user will prefer items similar to the assets he or she preferred previously. ItemCF is widely used by many others (for example, Amazon and Netflix), as it has two main advantages. Firstly, it is suitable for sites where there are a lot more users than items. Secondly, ItemCF could easily explain recommendations given users’ historical behaviors. For example, if you have watched “Family Guy” on Hulu, we will recommend “American Dad” to you and tell you that we recommend this because you have watched “Family Guy”. So we use ItemCF as our basic recommendation algorithm in Hulu.
Figure 1 shows our on-line architecture of the recommendation system. This system contains 5 main modules:
In the above on-line architecture, some components rely on offline resources, such as the topic model, related model, feedback model, etc. The off-line system is also an important part of our recommendation system. Our off-line system has these main components:
- Data Center: The data center contains all user behavior data in Hulu. Some of them are stored in Hadoop clusters and some of them are stored in a relational database.
- Related Table Generator: The related table is an important resource for on-line recommendation. We use two main types of related table: one that’s based on collaborative filtering (which we’ll call CF), and another based on content. In CF, show A and show B will have high similarity if users who like show A also like show B. With content filtering, we use content information including title, description, channel, company, actor/actress, and tags.
- Topic Model: A topic is represented by a group of shows that have similar content. Topics are thus larger in scope than shows, but they’re still smaller than channels. Our topics are learned by LDA, which is a popular topic model in machine learning.
- Feedback Analyzer: Feedback specifically means users’ reactions to recommendation results. Using user feedback can improve recommendation quality. For example, say a show is recommended to many users, but most of them do not click this show. In that case, we’ll decrease the rank of this show. Users will also have different types of behavior, so we’ll use all these behaviors in developing the recommendations. However, some users may prefer recommendations to come from their prior watch history, and some users may prefer their recommendations to come from their voting behavior. All these effects can be modeled offline by analyzing users’ feedback on their recommendations.
- Report Generator: Evaluation is most important part of the recommendation system. The report generator will generate a report including multiple metrics every day to show the quality of recommendations. At Hulu we monitor metrics including CTR, conversion ratio, etc.
So far, we’ve given a brief overview of our recommendation architecture. From previous discussion, we can see that Hulu’s recommendation system is primarily based on ItemCF. We’ve added many improvements on top of the ItemCF algorithm, too, in order to make it generate better recommendations. To test these improvements, we’ve performed many A/B tests on different algorithms. In following sections, we’ll introduce some of these algorithms and the experiment results.
Item-based Collaborative Filtering
Item-based Collaborative Filtering (ItemCF) is the basis of all our algorithms. In ItemCF, let N(u) be a set of items user u has preferred previously. User u’s preference on item i (i is not in N(u)) can then be measured by:
Here, r(u,j) is the preference weight of user u on show j, and s(i,j) is the similarity between show i and show j. In CF, the similarity between two shows is calculated by user behavior data on these two shows. Let N(i) be a set of users who watched show i and N(j) be a set of users who watched show j. Then, the similarity s(i,j) between show i and show j is calculated by following formula:
In this definition, show i will be highly relevant to show j if most users who watch show i will also watch show j. However, this definition will have the “Harry Potter problem,” which means that every show will have high relevance with popular shows.
The first lesson we learned from A/B testing is that recommendations should fit users’ recent preference and that users’ recent behavior is more important than their older, historical behaviors. So, in our engine, we will put more weight on users’ recent behaviors. In our system, CTR of recommendations that originate from users’ recent watch behavior is 1.8 times higher than CTR of recommendations originating from users’ old watch behavior.
Just because a recommendation system can accurately predict user behavior does not mean it produces a show that you want to recommend to an active user. For example, “Family Guy” is a very popular show on Hulu, and thus most users have watched at least some episodes from this show. These users do not need us to recommend this show to them — the show is popular enough that users will decide whether or not to watch it by themselves.
Thus, novelty is also an important metric to evaluate recommendations. The first way we think can increase novelty is by revising ItemCF algorithm:
- First, we will decrease weight of popular shows that users have watched before.
- Then, we’ll put more weight on shows that are not only similar to shows the active user watched before, but also less popular than shows the active user watched before.
Most users have diverse preferences, so the recommendation should also meet their diverse interests. In our system, we use explanations to diversify our recommendations. We think a diverse recommendation means most of the recommendation shows have different explanations.
We have performed an A/B test to show the usefulness of diversification (shown in the above figure). The results of the experiment show that, for active users who had previously watched 10 or more shows, diversification can increase recommendation CTR significantly.
A good recommendation system should not generate static recommendations. Users want to see new suggestions every time they visit the recommendation system. If a user has new behaviors, she will find her recommendations have changed because we have put more weight on the user’s recent behaviors. But if a user has no new behaviors, we also need to change our recommendations. We use three methods to keep temporal diversity of our system:
- First, we’ll recommend recently-added shows to users. Many new shows are added to Hulu every day, and we will suggest these shows to users who will like them. Thus, users will see fresh ideas for shows to watch when new ones are added.
- Second, we will randomize our recommendations. Randomization is the simplest way to keep recommendations fresh.
- Finally, we’ll decrease rank of recommendations which users have seen many times. This is called implicit feedback, and data show that CTR is increased by 10% after using this method.
Performance of Hulu’s Recommendation Hub
The recommendation hub is a personal recommendation page for every user. On this page users will see 6 carousels. The top carousel is “top recommendations”, which includes shows that we think users will prefer very much. After top recommendations, there are three carousels for three genres. These three genres are selected by analyzing users’ historical preferences. The next carousel is bookmarks, which include shows that users have indicated they’d like to watch later. The last carousel is filled with shows that the user has already rated. This carousel is designed to collect more explicit feedback from users.
We have performed an A/B test to compare our recommendation algorithms with two simple recommendation algorithms: Most Popular (which recommends the most popular shows to every user) and Highest Rated (which recommends highly-rated shows to every user). As shown in the above figure, experiment results show that the CTR of our algorithm is much higher than both simple methods.
Every user behavior can reflect user preferences.
In our system, we use a slew of user behaviors to come up with our recommendations. We’ve calculated the CTR of recommendations originating from different types of behaviors. As shown in Figure 3, we can see that recommendations from every type of behavior can generate recommendations that will be clicked by users.
Explicit Feedback data is more important than implicit feedback data
As shown in Figure 3, CTR of recommendations that originate from users’ historically loved (vote 5 stars on shows) and liked (vote 4 stars on shows) behaviors is higher than CTR of recommendations that come from users’ historical subscribe/watch/search behavior. So although the size our explicit feedback data is much smaller than implicit feedback data, they’re much more important.
Recent behaviors are much more important than old behaviors
Novelty, Diversity, and offline Accuracy are all important factors
Most researchers focus on improving offline accuracy, such as RMSE, precision/recall. However, recommendation systems that can accurately predict user behavior alone may not be a good enough for practical use. A good recommendation system should consider multiple factors together. In our system, after considering novelty and diversity, the CTR has improved by more than 10%.
Based on the paper “Recommendation System at Hulu” by Liang Xiang, Hua Zheng and Hang Li.
Hua Zheng is the senior lead developer in charge of the Hulu content recommendation and behavior targeting systems.
Dr. Xiang and Dr. Li, associate researchers, are working together on the recommendation system, helping users discover and enjoy relevant premium videos.