BARC Talk by Sandeep Silwal


We consider the problem of hypothesis selection: we are given knowledge of k discrete distributions $\{v_i\}_{i=1}^{\infty}$ over the domain [n] = {1,...,n} which we can preprocess. Then we get samples from an unknown discrete distribution p, also over [n]. The goal is to output the closest distribution to p among the v_i's in TV distance (up to some small additive error). The state of the art algorithms require \Theta(\log k) samples and run in near linear time.

We introduce a fresh perspective on the problem and ask if we can output the closest distribution in sublinear time. This question is particularly motivated as the problem is a generalization of the traditional nearest neighbor search problem: if we take enough samples, we can learn p explicitly up to low TV distance and then find the closest v_i in o(k) time using standard neast neighbor search algorithms.

However, this approach requires Omega(n) samples. Thus, it is natural to ask: can we obtain both sublinear number of samples and sublinear query time? We present some nice progress towards this question.

This is joint work with Anders Aamand, Alex Andoni, Justin Chen, Piotr Indyk, and Shyam Narayanan.