Blog Archives
Topic Archive: big data
Computational Logic SeminarThursday, December 5, 20134:15 pmGraduate Center, rm. 9204/9205
Large-data deduplication problem (a joint session with the Computer Science Colloquium).
Microsoft Reseach
Imagine that you have a long list of items, say a hundred thousands of items. For example, the items may be client addresses. Some of the addresses are essentially duplicates distinguished only by “St.” vs. “Street”, or “Bill” vs. “William”, or by little spelling errors, etc. You don’t want to miss any of your clients, and you don’t want to annoy them by sending them multiple copies of your communications. How do you clean up your item list? The problem is ubiquitous and hard. We analyze the problem and describe a fast probabilistic algorithm for it.