BARC Talk by Giorgio Vinciguerra
Learned Monotone Minimal Perfect Hashing
Abstract:
A Monotone Minimal Perfect Hash Function (MMPHF) is a hash function that maps keys from a given set S to their rank. On keys not in S, the MMPHF returns an arbitrary value. It is both perfect because it has no collisions on S, and minimal because its output range has size |S|.
Existing MMPHFs mostly rely on dividing the keys into buckets implemented as retrieval data structures and using a trie-like distributor, and they have found applications in databases, search engines, data encryption, and pattern-matching algorithms.
In this talk, we offer a fresh perspective on MMPHFs by introducing LeMonHash, an MMPHF that learns and leverages the smoothness of the input data.
The core idea of LeMonHash is surprisingly simple and effective: we learn a monotone mapping from keys to their rank via an error-bounded piecewise linear model, and then we solve the collisions that might arise among keys mapping to the same rank-estimate by associating them with small integers in a retrieval data structure.
On the theoretical side, LeMonHash achieves O(1) bits per key for sufficiently smooth inputs, breaking the superlinear lower bound for the general problem. On the practical side, on most datasets, LeMonHash dominates all competitors on space usage, construction and query time, simultaneously.