Topic Archive: big data
Imagine that you have a long list of items, say a hundred thousands of items. For example, the items may be client addresses. Some of the addresses are essentially duplicates distinguished only by “St.” vs. “Street”, or “Bill” vs. “William”, or by little spelling errors, etc. You don’t want to miss any of your clients, and you don’t want to annoy them by sending them multiple copies of your communications. How do you clean up your item list? The problem is ubiquitous and hard. We analyze the problem and describe a fast probabilistic algorithm for it.