Blog Archives

Topic Archive: data mining

Computational Logic SeminarThursday, December 5, 20134:15 pmGraduate Center, rm. 9204/9205

Yuri Gurevich

Large-data deduplication problem (a joint session with the Computer Science Colloquium).

Microsoft Reseach

Imagine that you have a long list of items, say a hundred thousands of items. For example, the items may be client addresses. Some of the addresses are essentially duplicates distinguished only by “St.” vs. “Street”, or “Bill” vs. “William”, or by little spelling errors, etc. You don’t want to miss any of your clients, and you don’t want to annoy them by sending them multiple copies of your communications. How do you clean up your item list? The problem is ubiquitous and hard. We analyze the problem and describe a fast probabilistic algorithm for it.