BARC/EADS Talk by Yufei Tao

On 1 June 2018, Professor Yufei Tao from the Department of Computer Science and Engineering at the Chinese University of Hong Kong will be Rasmus Pagh’s BARC guest and give a talk at BARC.

Title

Entity Matching with Active Monotone Classification 

Abstract

Given two sets of entities X and Y, entity matching aims to decide whether x and y represent the same entity for each x in R and y in Y. As the last resort, human experts can be called upon to inspect every (x, y), but this is expensive because the correct verdict could not be determined without investigation efforts dedicated specifically to the two entities x and y involved. It is therefore important to design an algorithm that asks humans to look at only some pairs, and renders the verdicts on the other pairs automatically with good accuracy. 

At the core of most (if not all) existing approaches is the following classification problem. The input is a set P of points in d-dimensional space, each of which carries a binary label: 0 or 1. A classifier F is a function that maps d-dimensional points to {0, 1}. The objective is to find a classifier that captures the labels of a large number of points in P.

In this paper, we cast the problem as an instance of active learning where the goal is to learn a monotone classifier F, namely, F(p) >= F(q) holds whenever the coordinate of p is at least that of q on all dimensions. In our formulation, the labels of all points in P are hidden at the beginning. An algorithm A can invoke an oracle, which discloses the label of a point p in P chosen by A. The algorithm may do so repetitively, until it has garnered enough information to produce F. The cost of A is the number of times that the oracle is called. The challenge is to strike a good balance between the cost and the accuracy of the classifier produced.

We describe algorithms with non-trivial guarantees on the cost and accuracy simultaneously. We also prove lower bounds that establish the asymptotic optimality of our solutions for a wide range of parameters.

Bio

Yufei Tao is a Professor in the Department of Computer Science and Engineering, Chinese University of Hong Kong. He served as an associate editor of ACM Transactions on Database Systems (TODS) from 2008 to 2015, and of IEEE Transactions on Knowledge and Data Engineering (TKDE) from 2012 to 2014. He served as a PC co-chair of International Conference on Data Engineering (ICDE) 2014. He gave a keynote speech at International Conference on Database Theory (ICDT) 2016. He received the best-paper awards at SIGMOD in 2013 and  2015, respectively, and a Google Faculty Research Award in 2017. He is an ACM distinguished scientist. His current research aims to develop "small-and-sweet" algorithms: (i) small: easy to implement for deployment in practice, and (ii) sweet: having non-trivial theoretical guarantees.