One of the cool things about being an intern at Hulu is that we are given the freedom to creatively apply our computer science knowledge to everyday problems. Our customer support process generated such a problem: we’ve got a stream of incoming customer support emails, and would like to automatically assign textual tags to them based on their content. This is a machine` learning problem of text classification, and one that we tackle below.
A binary text classification problem is to learn a rule that classifies a concept given a training set. The training set consists of pairs , being a text from some instance space , and being a binary label. We assume there exists a true decision function where . Our goal is to find a hypothesis that best approximates .
For example, if we take our instance space to be the set of all sentences in the English language, and to mean the sentence has positive sentiment, then our problem is to find a classifier that classifies the sentiment of English sentences. This example already illustrates problems we will face in the future — how do we reconcile the fact that the set of all sentences is potentially infinite? How do we represent our sentences in a way that suits learning best? And, most importantly, how do we best learn the true hypothesis from just our training examples?
Hulu gets thousands of customer support emails daily. Each email is represented by a ticket and associated with a set of concise descriptors called tags. With a well-performing tag classifier, the benefits are tangible — we can save our Customer Support Advocates (CSAs) the need to manually tag tickets, and we can route tickets to specific groups of CSAs who specialize in a particular subject area. In general, we would be able to streamline the support experience.
Our problem comes with a unique set of challenges. We would like our service to run on Donki, Hulu’s internal service hosting platform. Since we’d like to limit memory usage, we’d like to keep our approach bounded to a somewhat conservative 512mb. The size of our training set may be very large, and potentially not bounded. With these challenges in mind, we will describe the approach taken, what worked, what didn’t work, and the end result.
Given our wish for the service to run on Donki, Python seems like the natural choice. The NumPy/SciPy/Scikit-Learn stack is an accessible, open source, and comprehensive set of libraries that suffices for our purposes.
As with any Machine Learning problem, feature extraction, feature selection and model selection is crucial. Feature extraction decides which tickets are suitable for training, and extracts the useful parts of suitable tickets. This is the process of canonicalizing a ticket. Feature selection turns the canonical ticket extracted above into a training set. Model selection turns the training set into a classifier.
We would like our canonicalized ticket to be faithful to the eventual prediction context. This means, for example, that the CSA’s response to the ticket should not be included. Fortunately, all changes to a ticket are recorded in chronological order, so we are able to extract only the text and tags relating to the customer’s first contact.
Feature extraction also relies upon heuristics and domain-specific knowledge of the problem. Some additional tickets fields are also of value. We take the operating system field, the subject field, and whether the ticket was created by a Hulu Plus subscriber. For example, tickets from Hulu Plus subscribers are more likely to be about billing. To incorporate these additional fields, we append them to the existing text. For example, if a user is a Hulu Plus subscriber, we append the word “IsPlusSub”.
Sometimes, manually viewing the tickets and the extracted result is the only way of gaining insight. One particular issue was with users replying to emails would bring in a large amount of highly suggestive words. For example, an automated email from Hulu contained a link allowing users to “unsubscribe” to these emails, which was highly suggestive for certain tags. A heuristic was utilized to filter out all parts of the reply parts of the text. Another amusing example was that certain Hulu emails contained the words “Vizio”, “Samsung”, “Sony”, and other TV platforms on which Hulu Plus is available. An pre-existing rule would activate at these keywords, giving these tickets a large number of irrelevant tags.
As we hinted before, our instance space where we draw tickets from may be infinite. One common approach to this problem is the “bag of words” representation. We simplify our instance space from the set of all ticket text to the set of all “suitable” words, which is always finite. The “suitable” criteria is another heuristic, but intuitively, we want to exclude common but meaningless words. For example, the word “I” or “we”, or, in our case, “Hulu”, is meaningless with regards to the tags of a message. Excluding these words assists our model in finding meaningful relationships.
Our revised instance space is often called the vocabulary of our examples. Another optimization we can make is to expand our vocabulary using the technique of “n-grams”, which is a contiguous sequence of words of length n. n-grams can potentially capture relationships between words. For example, a ticket with words “not cancel” and another with “cancel not” would be equivalent in a vocabulary consisting solely of words. However, a 2-gram will capture this difference, since for the first ticket, the vocabulary will contain , while the second will contain .
In order to use the bag of words representation, we need words. This involves pre-processing and tokenizing the canonicalized tickets. We settled on the n-gram range of through experimental testing. This process results in very high dimensional data, since each word and each 2-sequnce of word is a dimension. We attempted dimensionality reduction techniques, to combat the so called “curse of dimensionality”. Dimensionality reduction techniques such as Principal Component Analysis and Latent Semantic Analysis attempt to find linear combinations of data that best correlate. Exploration on this front produced some graphics.
The semantic meaning of an “account_billing” tag and an “account_login_activation” tag is quite similar. Both have to do with a user’s account problems. Hence, when we projected the example down to its two most “information preserving” dimensions, we do not see clear separation. However, “account_billing” and “technology_playback” tags have quite distinct meanings from each other, and we see clear separation after dimensional reduction. Just to confirm our intuition on this, we plot “android” vs “ios”, two tags with distinct meanings, and get the following:
It’s always nice when mathematics are in line with our intuition. While these techniques were not ultimately used due to scalability problems, the ability to visualize and confirm our suspicious regarding separability of tags was highly valuable.
For each document, we want to create a corresponding bag of words representation. The simplest approach is to scan over the set of all documents (also called the corpus), and build a dictionary consisting of the entire vocabulary. If the length of this dictionary is , then a vector of length models a document, where the index represents an element of the vocabulary, and the value at the index is its frequency in the document. One common technique on top of frequency count is the “inverse document frequency” re-weighing, which intuitively says frequent words in the corpus are inherently less meaningful.
The above described technique is often referred to as term frequency-inverse document frequency vectorizing. The technique requires the entire vocabulary in memory, rendering it stateful. So we explored alternatives for scaling purposes. We settled upon using what is called the “hashing trick”. Very simply, instead of building a vocabulary by scanning the corpus once ahead of time, we explicitly pre-define the output space to be a high dimensional vector space. Then for each word and n gram, we hash and increment that index of the output vector. Because collisions are unlikely in high dimensional space, the hashed vector is a faithful representation of the original text. The one thing we gain in this case is statelessness, so we can use this technique on batches of documents. A brief survey of papers on this subject (here, and here) shows the hashing trick has both theoretical and empirical support.
In order to select a model, we need a way to calculate performance and a dataset to calculate performance on. Being a binary classification problem, performance is defined as counts of true positives, true negatives, false positives, and false negatives. How we view these numbers is interesting and worth discussing. In our case, two relevant metrics are “precision” and “recall”. Precision measures the ratio whereas recall measures the ratio . Achieving a high precision means that when we do predict the positive label, we are mostly correct. High recall means we are correctly predicting most of the positive labeled example positive. High precision usually comes at the cost of low recall, and vice versa. For example, if a model does not predict positive on any example except the very few that it’s very sure of, it achieves high precision, but low recall. If a model predict positive on every label, it achieves high recall but low precision.
Because our tool will be used as a pre-processing step in the handling of a ticket, we decided that high precision is more valuable than high recall. We would rather not have the classifier make a positive prediction, unless it is reasonably sure that this prediction is correct.
To get a dataset for testing, we employ cross validation and use a test set. In cross validation, we split the training set into folds, and for , we train on all folds other than the fold, and do scoring on the fold. We make a test set by holding back a ratio of the training set from training, and to calculate performance on that set after the model is trained. Both methods ensure the set we are scoring on does not participate in the training of the model.
The overall metric we would like to optimize is the score, which is the harmonic mean of precision and recall. It can be calculated as . We survey a broad variety of algorithms for classification, including: Naive Bayes, K-Nearest Neighbor, Support Vector Machine, Stochastic Gradient Decent, Decision Tree, and Random Forests, and AdaBoost. Unfortunately, the Decision Tree, Random Forests, and AdaBoost classifiers cannot handle sparse data, so they were immediately disqualified.
The scores, training, and test times of our different classifier is listed below for a collection of 15,000 training and 6,500 test documents.
|Stochastic Gradient Decent
|Bernoulli Naïve Bayes
|Multinomial Naïve Bayes
In the end, the Stochastic Gradient Decent classifier was chosen. It has the desirable combination of speed and performance, and can be tweaked to optimize for a variety of objective functions. Furthermore, it is an online algorithm, meaning the training process can be split up in batches, alleviating memory concerns.
Odds and Ends
There are many tags that are simply unpredictable. Text classification relies upon the label having a clear and distinguishable meaning. For example, we found tags that indicate an action such as giving a customer a credit performed very poorly, since it is hard to gauge from the contents of the text whether a credit will be given. As such, we’ve implemented an automated pruning system into our classifier. We score each tag’s classifier at the end of training on the test set, and if that tag’s performance does not satisfy some minimum, we do not ever predict true for that tag.
Secondly, we may want to partition tags into classes, where at most one tag can be predicted per class. For example, we may wish to avoid the many device tags problem described above and enforce a maximum of 1 device tag per ticket. To solve this problem, we appeal to geometry. Any linear classifier is a hyperplane in the feature space, and can be represented as a weight vector, . Predictions for an incoming vector is made via , since any positive number means this example lies on the positive label side of the hyperplane. However, if we have a tag class , and a set of classifiers , then we can output a prediction , provided . This is commonly referred to as the one-vs-all method of extending binary classifiers to predict on a tag class .
Applying machine learning techniques to real world data is tricky, exciting, and rewarding. Here, we’ve outlined some of the decisions we’ve made, approaches that didn’t work, ones that did, and how the steps of feature extraction, feature selection, and model selection were performed. In the future, we may expand into classifying phone transcriptions, and personalize the classifier for individual CSAs, to further optimize the automation within our customer support platform.
Chris Liu is a software developer intern on our customer platform team.