## Latest Posts

### Why does Map-Reduce work?

Counting word frequencies in a collection of documents is the
"Hello World" of Hadoop,
with good reason. It is a not-too-contrived task whose **underlying
structure** is a natural fit for distributed computation. In this
post we focus on better understanding that underlying structure using
some tools from abstract algebra. This approach has useful practical
consequences - I was originally motivated to further explore these
concepts by a discussion of the theoretical foundations of the
Algebird data processing
library during an interesting
meetup talk
on scalable and flexible machine learning.

Code fragments in this post use Scala, but hopefully they are short
and simple enough for non-Scala programmers to understand easily.

### String monoid under concatenation

Loosely speaking, a monoid consists of

- a
*collection of objects* (eg, strings)
- a
*binary operation* for combining these objects to yield a new object of the same type (eg, string concatenation)
- an
*identity object* whose combination leaves an object unchanged (eg, the empty string)

*On two occasions I have been asked,—"Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" In one case a member of the Upper, and in the other a member of the Lower, House put this question. I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.*

Charles Babbage (source: Wikiquote)

One of the benefits of
working in the San Francisco Bay Area
is access to tons of interesting tech talks and other related
events. Last week I attended the
Bay Area Machine Learning Symposium, which
was graciously hosted by Google. Below are some brief notes on some
of the talks (any errors or misunderstandings are of course mine).
From my perspective, an interesting theme that emerged was the
"asynchronous update" design strategy for large-scale distributed
machine learning.

## Keynotes

The keynote talks were excellent,
and represented very different branches of work in modern machine
learning.

### Convex Optimization: From embedded real-time to large-scale distributed (Stephen Boyd, Stanford)

**Big idea:** Convex optimization is a general and powerful tool: many
real-world applications admit formulation as convex optimization
problems, and the theory/practice/technology for solving these is
quite well-developed.

**Current research:** This talk suggested a loose categorization for
thinking about the scale of different convex optimization problems:

It is also useful to have a simple way to display nicely formatted
code. Our example snippet is a Clojure
function for constructing a machine learning feature encoder, and we
display it by

### GitHub gist embedding

### Google Prettify

;; Assumes examples are Clojure maps,
;; for example from clojure.contrib.sql
;;
;; {:eyecolor "brown" :heightmeters 1.7 ...}
;; {:eyecolor "blue" :heightmeters 1.5 ...}
;;
;; (def eye-encoder (discrete-feature-encoder :eyecolor myexamples))
;;
(defn discrete-feature
"Create binary 1-of-N encoding of a discrete feature"
[featurekey allexamples]
(let [uniqvals (->> allexamples (map featurekey) set vec)]
{:values (map #(format "%s:%s" (str featurekey) %1) uniqvals)
:encoder (fn [example]
(map #(if (= %1 (featurekey example)) 1 0) uniqvals))}))

It is very useful to have nicely formatted equations in blog posts.
The test example below uses the MathJax
JavaScript library, with a ruhoh configuration based on a
GitHub comment
from Ramnath Vaidyanathan.

### Loss functions and empirical risk

Given a dataset of observed input-output pairs $(x_1, y_1), \ldots,
(x_N, y_N)$, we can estimate a function $f(x)$ to predict future
values of $y$ given $x$. We measure the difference between the true
$y$ and our prediction $f(x)$ with a loss function, such as $\ell(y,
f(x)) = (y-f(x))^2$. We can then compute the empirical risk $R$ of
our function $f$ with respect to loss function $\ell$ over a dataset as:
$$ R(\ell,f) = \frac{1}{N} \sum_{i=1}^{N} (y_i - f(x_i))^2 $$

A machine learning research paper tends to present a newly proposed
method or algorithm in relative isolation. Problem context, data
preparation, and feature engineering are hopefully discussed to the
extent required for reader understanding and scientific
reproducibility, but are usually not the primary focus. Given the
goals and constraints of the format, this can be seen as a reasonable
trade-off: the authors opt to spend scarce "ink" on only the most
essential (often abstract) ideas.

As a consequence, implementation details relevant to the use of the
proposed technique in an actual production system are often not
mentioned whatsoever. This aspect of machine learning is often left as
"folk wisdom" to be picked up from colleagues,
blog
posts,
discussion boards,
snarky tweets,
open-source libraries, or more often than
not, first-hand experience.

Papers from conference "industry tracks" often deviate from this
template, yielding valuable insights about what it takes to make
machine learning effective in practice. This paper from Google on
detecting "malicious" (ie, scam/spam) advertisements won
best industry paper
at KDD 2011 and is a particularly
interesting example.

This is an early experiment using ruhoh for
blogging. Ruhoh is a **static** blog generation tool: given properly
formatted input content, ruhoh simply **generates** the fixed code for a
blog-style website. This generated code **is** the website - you can
just serve it up, for example from
Amazon S3.
This style of blog has somewhat recently become a flourishing
technological niche; probably the most famous specimen is
Jekyll which is used to drive
GitHub Pages. I'll briefly describe the
appeal of this approach for my needs.

### Plaintext

The blog content and styling consists of nothing more than a specially
structured directory of plaintext files. This means you can easily
write posts in the text editor of your choice, version control
everything, and publish from the command-line. Pretty sweet, if
you're into that sort of thing.

### Simplicity

By definition, a static site has fewer "moving parts" than a
fully-loaded dynamic one. This sacrifices some flexibility in terms
of commenting, plugins, and other nifty dynamic content. However if
you simply want to write things down and put them on The Internet, the
loss of these features can be outweighed by the gain of not having to
deal with their complexity.