Large-data deduplication problem (a joint session with the Computer Science Colloquium).
Imagine that you have a long list of items, say a hundred thousands of items. For example, the items may be client addresses. Some of the addresses are essentially duplicates distinguished only by “St.” vs. “Street”, or “Bill” vs. “William”, or by little spelling errors, etc. You don’t want to miss any of your clients, and you don’t want to annoy them by sending them multiple copies of your communications. How do you clean up your item list? The problem is ubiquitous and hard. We analyze the problem and describe a fast probabilistic algorithm for it.
Yuri Gurevich is a Principal Researcher at Microsoft and Prof. Emeritus at the University of Michigan. He is also an ACM Fellow, a Guggenheim Fellow, a member of Academia Europaea, and Dr. Honoris Causa of Hasselt University in Belgium and of Ural State University in Russia.