Latest Posts

Word count is a monoid homomorphism 2013-04-30

Why does Map-Reduce work?

Counting word frequencies in a collection of documents is the "Hello World" of Hadoop, with good reason. It is a not-too-contrived task whose underlying structure is a natural fit for distributed computation. In this post we focus on better understanding that underlying structure using some tools from abstract algebra. This approach has useful practical consequences - I was originally motivated to further explore these concepts by a discussion of the theoretical foundations of the Algebird data processing library during an interesting meetup talk on scalable and flexible machine learning.

Code fragments in this post use Scala, but hopefully they are short and simple enough for non-Scala programmers to understand easily.

String monoid under concatenation

Loosely speaking, a monoid consists of

  • a collection of objects (eg, strings)
  • a binary operation for combining these objects to yield a new object of the same type (eg, string concatenation)
  • an identity object whose combination leaves an object unchanged (eg, the empty string)

Charles Babbage: Big Data visionary 2012-09-14

On two occasions I have been asked,—"Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" In one case a member of the Upper, and in the other a member of the Lower, House put this question. I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.

Charles Babbage (source: Wikiquote)

Notes from BayLearn 2012 2012-09-08

One of the benefits of working in the San Francisco Bay Area is access to tons of interesting tech talks and other related events. Last week I attended the Bay Area Machine Learning Symposium, which was graciously hosted by Google. Below are some brief notes on some of the talks (any errors or misunderstandings are of course mine). From my perspective, an interesting theme that emerged was the "asynchronous update" design strategy for large-scale distributed machine learning.

Keynotes

The keynote talks were excellent, and represented very different branches of work in modern machine learning.

Convex Optimization: From embedded real-time to large-scale distributed (Stephen Boyd, Stanford)

Big idea: Convex optimization is a general and powerful tool: many real-world applications admit formulation as convex optimization problems, and the theory/practice/technology for solving these is quite well-developed.

Current research: This talk suggested a loose categorization for thinking about the scale of different convex optimization problems:

TEST POST - embedding code with gists and prettify 2012-08-21

It is also useful to have a simple way to display nicely formatted code. Our example snippet is a Clojure function for constructing a machine learning feature encoder, and we display it by

GitHub gist embedding

Google Prettify

;; Assumes examples are Clojure maps, 
;; for example from clojure.contrib.sql
;;
;; {:eyecolor "brown" :heightmeters 1.7 ...}
;; {:eyecolor "blue" :heightmeters 1.5 ...}
;;
;; (def eye-encoder (discrete-feature-encoder :eyecolor myexamples))
;;
(defn discrete-feature
  "Create binary 1-of-N encoding of a discrete feature"
  [featurekey allexamples]
  (let [uniqvals (->> allexamples (map featurekey) set vec)]
    {:values (map #(format "%s:%s" (str featurekey) %1) uniqvals)
     :encoder (fn [example]
                (map #(if (= %1 (featurekey example)) 1 0) uniqvals))}))

TEST POST - LaTeX equations with MathJax 2012-08-19

It is very useful to have nicely formatted equations in blog posts. The test example below uses the MathJax JavaScript library, with a ruhoh configuration based on a GitHub comment from Ramnath Vaidyanathan.

Loss functions and empirical risk

Given a dataset of observed input-output pairs $(x_1, y_1), \ldots, (x_N, y_N)$, we can estimate a function $f(x)$ to predict future values of $y$ given $x$. We measure the difference between the true $y$ and our prediction $f(x)$ with a loss function, such as $\ell(y, f(x)) = (y-f(x))^2$. We can then compute the empirical risk $R$ of our function $f$ with respect to loss function $\ell$ over a dataset as: $$ R(\ell,f) = \frac{1}{N} \sum_{i=1}^{N} (y_i - f(x_i))^2 $$

Practical machine learning tricks from the KDD 2011 best industry paper 2012-07-26

A machine learning research paper tends to present a newly proposed method or algorithm in relative isolation. Problem context, data preparation, and feature engineering are hopefully discussed to the extent required for reader understanding and scientific reproducibility, but are usually not the primary focus. Given the goals and constraints of the format, this can be seen as a reasonable trade-off: the authors opt to spend scarce "ink" on only the most essential (often abstract) ideas.

As a consequence, implementation details relevant to the use of the proposed technique in an actual production system are often not mentioned whatsoever. This aspect of machine learning is often left as "folk wisdom" to be picked up from colleagues, blog posts, discussion boards, snarky tweets, open-source libraries, or more often than not, first-hand experience.

Papers from conference "industry tracks" often deviate from this template, yielding valuable insights about what it takes to make machine learning effective in practice. This paper from Google on detecting "malicious" (ie, scam/spam) advertisements won best industry paper at KDD 2011 and is a particularly interesting example.

Static blogging with ruhoh 2012-07-13

This is an early experiment using ruhoh for blogging. Ruhoh is a static blog generation tool: given properly formatted input content, ruhoh simply generates the fixed code for a blog-style website. This generated code is the website - you can just serve it up, for example from Amazon S3. This style of blog has somewhat recently become a flourishing technological niche; probably the most famous specimen is Jekyll which is used to drive GitHub Pages. I'll briefly describe the appeal of this approach for my needs.

Plaintext

The blog content and styling consists of nothing more than a specially structured directory of plaintext files. This means you can easily write posts in the text editor of your choice, version control everything, and publish from the command-line. Pretty sweet, if you're into that sort of thing.

Simplicity

By definition, a static site has fewer "moving parts" than a fully-loaded dynamic one. This sacrifices some flexibility in terms of commenting, plugins, and other nifty dynamic content. However if you simply want to write things down and put them on The Internet, the loss of these features can be outweighed by the gain of not having to deal with their complexity.