12 August 2021

From basic research to machine learning libraries

Result

This is a story of foundational research, and innovative algorithms motivated by basic algorithmic questions, unfolding into real-world impact: the TensorSketch, from theory to practice.

Machine learning algorithms often work with large sets of features that are used to describe a digital object such as an image, a piece of text, or the database records of a person. The set of features can be used for tasks such as prediction or classification. Often, the set of possible features is very large, with only a small fraction of features appearing in each object, which already poses a challenge to scalability of algorithms. To make things worse, it is often of essence to consider combinations of features to get powerful learning algorithms. For illustration, the individual presence of the words “basic” and “research” in a document does not say much about the topic of the document, but the joint presence in a single sentence suggests that the document discusses basic research. To model such combinations, we may conceptually imagine a much larger space of features, consisting of all possible pairs of features. Or even triples! But how can we effectively deal with such large feature spaces?

Fortunately, powerful dimensionality reduction algorithms have been developed that are able to map very high-dimensional feature vectors into vectors of, say, a few hundred dimensions, while retaining the essential information. This means that machine learning algorithms do not have to deal with huge feature vectors, but may instead work with such reduced, or “compressed”, representations. However, a new bottleneck becomes how to efficiently map a set of features to a low-dimensional representation, especially if this representation should reflect pairwise (or k-wise) interactions.

In 2011 BARC core member Rasmus Pagh was thinking about how to efficiently compute compressed representations of matrix products, motivated by applications in linear algebra. He realised that the dimension reduction technique Count Sketch could be used in combination with the Fast Fourier Transform to get an efficient algorithm that computed the compressed representation directly instead of first computing the matrix product and then performing dimension reduction on the result. The results were published in ITCS, a conference dedicated to conceptual innovation in theoretical computer science.

In 2012 PhD student Ninh Pham realised that the algorithmic technique used for compressed matrix multiplication could be used to remove the computational bottleneck that was making it difficult to perform dimension reduction on feature pairs, triples, etc., so-called polynomial kernels. In collaboration with Rasmus, Ninh developed and analysed the TensorSketch algorithm, verified its performance empirically, and published the results in the KDD 2013 conference.

In recent years, feature mappings like the one we developed in 2011-2013 have become increasingly important. In late 2020, the TensorSketch algorithm was included in scikit-learn, which is one of the world’s most popular machine learning libraries. This shows how foundational research, and innovative algorithms motivated by basic algorithmic questions, can have a real-world impact.