Counting word frequencies in a collection of documents is the “Hello World” of Hadoop, with good reason. It is a not-too-contrived task whose underlying structure is a natural fit for distributed computation. In this post we focus on better understanding that underlying structure using some tools from abstract algebra. This approach has useful practical consequences - I was originally motivated to further explore these concepts by a discussion of the theoretical foundations of the Algebird data processing library during an interesting meetup talk on scalable and flexible machine learning.
Code fragments in this post use Scala, but hopefully they are short and simple enough for non-Scala programmers to understand easily.
Loosely speaking, a monoid consists of
For the word count example, we can consider each document to be a single string, and the complete corpus to be the (space-separated) concatenation of all documents.
The output of word count is a mapping from words (strings) to their frequencies in the corpus (integers). These maps naturally form another monoid, where the binary operation is key-wise addition using 0 as the value for missing keys. That is, we combine two maps by summing the values for each word as one would naturally do when combining counts:
It should be fairly clear that the identity element here is simply the empty map. Note that it was straightforward to form this construction in part because the integers are themselves a monoid under addition with identity element 0.
A monoid homomorphism is a structure-preserving function from one monoid to another. To state this more clearly, let’s introduce a small amount of notation.
Let $s \in S$ be a string, with $s_1 + s_2$ representing string concatenation. Then let $m \in M$ be a count map, with $m_1 \oplus m_2$ representing key-wise addition.
We then define our wordCount() function $f: S \rightarrow M$. The monoid homomorphism property is then given by\[f(s_1 + s_2) = f(s_1) \oplus f(s_2)\]
That is, in order for wordCount() to be a monoid homomorphism from strings to count maps, we must get the same result from concatenating the strings and counting their words or counting the words of the individual strings and summing the count maps. It is obvious that a simple token-counting implementation of wordCount() satisfies this property, assuming that our string concatenation does not affect tokenization (eg, we add whitespace between pairs of concatenated strings).
So there you have it - the innumerable “how to count words in Hadoop” tutorials available on the web are in fact well-disguised introductions to monoid homomorphisms. The homomorphism property is the “secret sauce” that ensures a Map-Reduce style computation of word frequencies gives the same result, no matter how we split up the document corpus.
While this all may seem a bit abstract (indeed, that is the point), I would argue that there are practical benefits to thinking about data processing in this way. We now have a nice characterization of the underlying structure that makes a given aggregation well-suited for distributed computation in a Map-Reduce (or similar) framework:
The generality of this perspective can be quite powerful. As mentionedearlier, the Algebird library is built around algebraic abstractions such as monoids. These abstractions can be used to cleanly separate concerns in data processing infrastructure: “plumbing” code can be written against an abstract monoid “interface”.
Another interesting observation is that a variety of probabilistic data structures can be defined as monoids, and in fact Algebird contains implementations for several of them. For example, this could allow you to trade exactness for scalability by (very) easily substituting an approximate version of wordCount() into your data pipeline.
Basic familiarity with the vocabulary of abstract algebra can also be helpful to take full advantage of libraries which leverage these concepts, like Algebird and scalaz.
Finally, I have always found it intrinsically beneficial to try to understand things from multiple angles. The algebraic formulation of distributed data processing provides a slightly different and valuable way of thinking about problems, solutions, and their properties in this domain.
Jimmy Lin recently posted a very relevant and interesting paper about the practical consequences of monoids for the “combiner” processing stage of Map-Reduce : “Monoidify! Monoids as a Design Principle for Efficient MapReduce Algorithms” (arXiv).Written on April 30th, 2013 by David M. Andrzejewski