A machine learning research paper tends to present a newly proposed method or algorithm in relative isolation. Problem context, data preparation, and feature engineering are hopefully discussed to the extent required for reader understanding and scientific reproducibility, but are usually not the primary focus. Given the goals and constraints of the format, this can be seen as a reasonable trade-off: the authors opt to spend scarce “ink” on only the most essential (often abstract) ideas.
As a consequence, implementation details relevant to the use of the proposed technique in an actual production system are often not mentioned whatsoever. This aspect of machine learning is often left as “folk wisdom” to be picked up from colleagues, blog posts, discussion boards, snarky tweets, open-source libraries, or more often than not, first-hand experience.
Papers from conference “industry tracks” often deviate from this template, yielding valuable insights about what it takes to make machine learning effective in practice. This paper from Google on detecting “malicious” (ie, scam/spam) advertisements won best industry paper at KDD 2011 and is a particularly interesting example.
Detecting Adversarial Advertisements in the Wild D. Sculley, Matthew Otey, Michael Pohl, Bridget Spitznagel, John Hainsworth, Yunkai Zhou http://research.google.com/pubs/archive/37195.pdf
At first glance, this might appear to be a “Hello-World” machine learning problem straight out of a textbook or tutorial: we simply train a Naive Bayes on a set of bad ads versus a set of good ones. However this is apparently far from being the case - while Google is understandably shy about hard numbers, the paper mentions several issues which make this especially challenging and notes that this is a business-critical problem for Google.
The paper describes an impressive and pragmatic blend of different techniques and tricks. I’ve briefly described some of the highlights, but I would certainly encourage the interested reader to check out the original paper and presentation slides.
The core machine learning technique is (unsurprisingly) classification: is this ad OK to show to the user or not? Code for the some of the core machine learning algorithms involved is available.
Like the winning submission to the Netflix Prize, Microsoft Kinect, and IBM Watson, the proposed system uses an ensemble approach where the outputs of many different models are combined to yield a final prediction. This technique constitutes the closest thing to a “free lunch” in modern machine learning; if raw prediction accuracy is the goal, the use of ensembles should at least be considered.
While there is additional work needed to properly calibrate and quantify prediction uncertainty, in this application it is worthwhile to enable the automated system to simply say “I don’t know” when appropriate and escalate the decision to a human.
Feature representation is a crucial machine learning design decision. They cast a very wide net in terms of representing an ad including words and topics used in the ad, links to and from the ad landing page, information about the advertiser, and more. Ultimately they rely on strong L1 regularization to enforce sparsity and uncover a limited number of truly relevant features.
This is a very practical trick for handling high-dimensional feature spaces by hashing features down to a lower dimensional space - this answer on the MetaOptimize discussion board gives a nice explanation with references.
Highly imbalanced classes (here there are far more legitimate ads than scam ones) are a classic supervised classification “gotcha”. There are different ways to handle this, but here they achieve improved performance by transforming the problem to ranking: all malicious ads should be “ranked higher” than legitimate ones.
Besides class imbalance, the task is also complicated by the existence of different “kinds” of bad ads (ads that direct you to malware, ads for counterfeit goods, etc). They simultaneously address both issues by using two-stage classification. First, is the ad Good or Bad? Second, if the ad is Bad, is it BadTypeA or not, is it BadTypeB or not, and so on.
Unlike experimental software written primarily for a research paper, a production machine learning system exists within an engineering and business context. This puts increased importance on scalability, validation, reliability, and maintainability.
Somewhat surprisingly, they find the scalability bottleneck to be loading examples from disk and extracting features. Therefore, they perform that work in parallel Map jobs, and used a single Reduce job to do Stochastic Gradient Descent (SGD) classifier training.
To make sure the system “still works” as its inputs evolve over time, they do extensive monitoring of key quantities and investigate further if large changes are observed:
In machine learning papers, a predictive model is often boiled down to its mathematical essence - nothing more than a vector of learned weights. However, in software engineering practice the authors find it useful to extend the “model object” to include feature transformation, probability calibration, training hyperparameters, as well as other information.
The business importance and general trickiness of the problem necessitates the use of human experts as an integral part of the overall solution.
They use active learning-esque strategies to identify the highest “value” examples for humans experts to manually label (eg, ambiguous or particularly difficult cases). They also provide an information retrieval-based user interface to help experts in searching for new emerging threats.
Sometimes “the human knows best” - they are not dogmatic about doing absolutely everything with fully automated machine learning methods. Instead, they allow the experts to craft simple hard-coded rules when appropriate.
Even expert judgments cannot be interpreted as the Fixed and Absolute Truth. Expert-provided labels may differ due to human error, varying interpretations of label categories, or simple differences of opinion. To adjust for this uncertainty, they use multiple expert judgments on the same ads to calibrate confidences in both annotations and annotators. There is a sizable body of research on how to do this (eg, for Amazon Mechanical Turk tasks). See the LingPipe blog for more references and discussion.
Finally, they also periodically use non-expert evaluations to make sure the system is working well in the eyes of regular users. Since end-user satisfaction is (presumably!) the ultimate objective, actually measuring this is probably a very good idea.
Written on July 26th, 2012 by David M. Andrzejewski