Академический Документы
Профессиональный Документы
Культура Документы
The suppliers table that I'm working with contains 41 columns, most of
which are no use whatever to finding duplicates. For example, the time
a record was last modified doesn't help, nor does the primary key. And
so on. As far as I can see, only six columns are of any help here. So
apart from the framework (and some data-cleaning functions), once
I've imported the framework, this is all the Python code I need to check
for duplicates:
THRESHOLD = 0.85
file = "suppliers-merged.csv"
So, we set the probability threshold, give the file name, and define how
we want to treat our six identity-discriminating properties. Let's walk
through them one by one.
So here you see immediately a benefit of the approach: let's say two
organizations have the same name, but different organization
numbers. That gives us 0.9 and 0.1 probability, which combines to 0.5.
Having weighed the evidence, Bayes steps in and saves us from making
what's probably a mistake. However, if two organizations have the
same name and no organization number, we get 0.9 and 0.5, which
combines to 0.9.
Anyway, ZIP_CODE is the next one. The zip code provides only a minor
hint, but if the two suppliers have different codes the likelihood they
are the same is slightly lower (0.4), and if they have different ones it is
slightly higher (0.6). Let's say two records have the same name, no
organization number, and different zip codes. This gives us 0.9, 0.5,
and 0.4. This combines to 0.857, which is just above the threshold. If
we find more counter-indicating evidence this pair is going to drop
below the threshold.
ADRESS1 and ADDRESS2 are much like the zip code, except differences here
(often due simply to spelling or different abbreviations for "street" or
"mailbox") don't signify much. However, if they are the same that's a
fairly strong indicator.
Example data
Let's look at how this works on real data. I'll pass over the cases where
the organization number is the same, and look at ones were the other
attributes do the job. The data have been slightly modified so that I'm
not exposing customer data on the web.
This sums out to 0.93 probability, which is well above the threshold, as
indeed it should be. You'll note that my cleaning of the address is not
sophisticated enough, so it doesn't catch that ADDRESS1 is in fact the
same. This is easy to fix.
The beauty of this approach is that it works astoundingly well for the
amount of effort involved. After just two hours I find huge numbers of
duplicates with low false positive rates. The main downside is that the
pairwise comparison is slow for a table of 17,000 rows, but even this
can be solved. And perhaps the best part is that once the basic (trivial)
framework is up, doing another table takes just a few minutes.
Note that I'm not necessarily recommending that you do this yourself.
There are real tools out there, both open source and commercial, which
I'm sure do a vastly better job than a Python script cooked up in two
hours, but I thought this was useful for illustrating the concept is, and
how much you can get done with simple methods.
More advanced stuff
Of course, simply comparing the names as strings is not very
sophisticated. What if one company is called "acme inc" and another
simpy "acme"? This is what the name_compare function briefly referenced
in the example takes care of. Essentially, it breaks up the names into
tokens, then compares the sets of tokens. At the moment it does so
fairly primitively, but it's quite effective. Unfortunately, it leads to some
false positives at the moment. My next task is to figure out how to
improve that.
This, by the way, is one area where the ready-made products shine.
They understand misspellings, different transcriptions of the same
names, matching of complex strings, and so on. So once you need these
things (or performance) you may be better off with a ready-made
product.