One of the benefits of working in the San Francisco Bay Area is access to tons of interesting tech talks and other related events. Last week I attended the Bay Area Machine Learning Symposium, which was graciously hosted by Google. Below are some brief notes on some of the talks (any errors or misunderstandings are of course mine). From my perspective, an interesting theme that emerged was the “asynchronous update” design strategy for large-scale distributed machine learning.
The keynote talks were excellent, and represented very different branches of work in modern machine learning.
Big idea: Convex optimization is a general and powerful tool: many real-world applications admit formulation as convex optimization problems, and the theory/practice/technology for solving these is quite well-developed.
Current research: This talk suggested a loose categorization for thinking about the scale of different convex optimization problems:
Research directions included adapting to the “high” and “low” ends of the scale via distributed algorithms and code generation tools, respectively.
Other resources: Prof. Boyd is the co-author (along with Lieven Vandenberghe) of the definitive textbook Convex Optimization, which is available as a free PDF download.
Convex optimization also plays a key role in different research areas such as Compressive Sensing (CS), which allows you to “beat” the Nyquist-Shannon sampling theorem, and low-rank matrix completion, which is one way to mathematically formulate the well-known “Netflix Challenge”. The repository maintained by Rice University and the Nuit Blanche blog are great resources for learning more about these fields.
Big idea: For many applications of machine learning, the main bottleneck is the human expertise and effort required to craft suitable features for use as inputs to the algorithm. Deep learning techniques aim to automatically learn useful representations from the data. For example, deep learning algorithms applied to images often “discover” a “vocabulary” of edge detectors and simple shapes. The use of these building blocks (versus “raw” pixel data) can then lead to improved performance on vision tasks like object recognition.
Current research: A joint Stanford-Google team recently published a paper at ICML 2012 where an unsupervised feature learning system was trained on a large corpus of YouTube videos using some of Google’s considerable computational resources. The system learned to recognize human and (more amusingly) cat faces, a result which garnered a good deal of press coverage. One technical challenge involved scaling up their approach to a large-scale distributed setting, which was accomplished using asynchronous updates against a central “parameter server”.
Other resources: The Stanford deep learning page has links to papers and an interesting deep learning tutorial. Prof. Ng had also given a talk on this subject at an ICML 2011 workshop, and the slides give a great background and context for the ICML 2012 paper.
Finally, if you are at all interested in this area of research, I would strongly recommend enrolling in the (free!) online Coursera class Neural Networks for Machine Learning, taught by Geoffrey Hinton.
Big idea: Probabilistic modeling provides a useful framework for developing rich models of user interests and behavior. Here “rich” means models that attempt to capture more realistic details, or equivalently, to make fewer simplifying assumptions. For example, we can explicitly model the fact that a user’s interests are not static, but rather evolve over time. Given enough data, these higher-fidelity models can yield significant gains in prediction and recommendation performance.
Current research: These models often include many latent variables representing unobserved quantities such as the underlying “topics” of a given news article, or the mix of “interests” associated with a particular user. This usually leaves us with two related challenges when doing inference:
Besides this, there are very interesting design questions about which aspects of reality to capture in the model, and which to ignore.
Other resources: In the talk Prof. Smola alluded to a scalable distributed implementation, of the Latent Dirichlet Allocation (LDA) topic model that was developed while he was at Yahoo. Interestingly, the organization of this system was similar to the one later used by Prof Ng. and colleagues, using asynchronous updates against a global “parameter server”.
As it turns out, Coursera is also offering a free online course in Probabilistic Graphical Models, taught by Daphne Koller.
Guaranteed Sparse Optimization on the Probability Simplex via Convex Programming (Laurent El Ghaoui, UC Berkeley; Mert Pilanci, UC Berkeley; Venkat Chandrasekaran, Caltech) - This paper addresses the problem of cardinality-penalized optimization of a convex function (ie, the L0 norm) over the probability simplex. While penalizing the L1 norm can often encourage sparsity, this doesn’t work here because the L1 norm is 1 for all feasible solutions (by the definition of the simplex).
Fixed-Form Variational Posterior Approximation Through Stochastic Linear Regression (Tim Salimans, Erasmus U. Rotterdam; David Knowles, Stanford University) - This paper presents an interesting general technique for variational inference. It is also available as a (much) longer arXiv version.