Вы находитесь на странице: 1из 4

The code

The suppliers table that I'm working with contains 41 columns, most of
which are no use whatever to finding duplicates. For example, the time
a record was last modified doesn't help, nor does the primary key. And
so on. As far as I can see, only six columns are of any help here. So
apart from the framework (and some data-cleaning functions), once
I've imported the framework, this is all the Python code I need to check
for duplicates:

THRESHOLD = 0.85
file = "suppliers-merged.csv"

relevant = [Property("NAME", 0.5, 0.9, name_clean, name_compare),


Property("ASSOCIATION_NO", 0.1, 0.88, orgnr_clean),
Property("ZIP_CODE", 0.4, 0.6, basic_clean),
Property("COUNTRY", 0.1, 0.51, basic_clean),
Property("ADDRESS1", 0.49, 0.75, basic_clean),
Property("ADDRESS2", 0.49, 0.75, basic_clean)]

self_compare(file, THRESHOLD, relevant)

So, we set the probability threshold, give the file name, and define how
we want to treat our six identity-discriminating properties. Let's walk
through them one by one.

NAME is the name of the organization. Here we provide a function to


normalize the names (name_clean), which strips out extra whitespace,
changes "Ltd." into "Limited", lowercases the name, and so on. The
first number says that if the names of two records are different, the
probability that the organizations are the same is 0.5 (that is, 50/50,
meaning we've learned nothing). The second says that if the
names are the same the probability that they are the same is 90%. This
is above our threshold of 0.85, so unless we later find evidence
indicating that the organizations are different we will consider them
duplicates.
ASSOCIATION_NO is the unique organization number assigned to the
organization by the government. You'd think we could just do SQL
directly on this column and go home, but it's not that simple. The
numbers are formatted differently, not all suppliers have one, and in
some cases different parts of a large organization (like the City of Oslo)
share a single number. We provide a function to clean up the data, then
specify our probabilities: if both numbers are provided and they are
different, there's only a 10% chance the organizations are the same.
However, if we get a match there's an 88% chance that they are the
same.

So here you see immediately a benefit of the approach: let's say two
organizations have the same name, but different organization
numbers. That gives us 0.9 and 0.1 probability, which combines to 0.5.
Having weighed the evidence, Bayes steps in and saves us from making
what's probably a mistake. However, if two organizations have the
same name and no organization number, we get 0.9 and 0.5, which
combines to 0.9.

Anyway, ZIP_CODE is the next one. The zip code provides only a minor
hint, but if the two suppliers have different codes the likelihood they
are the same is slightly lower (0.4), and if they have different ones it is
slightly higher (0.6). Let's say two records have the same name, no
organization number, and different zip codes. This gives us 0.9, 0.5,
and 0.4. This combines to 0.857, which is just above the threshold. If
we find more counter-indicating evidence this pair is going to drop
below the threshold.

COUNTRY is much the same. If the organizations are in different countries


chances are very slim that they're the same one (although it could be a
typing error), whereas if they are in the same country that barely
signifies.

ADRESS1 and ADDRESS2 are much like the zip code, except differences here
(often due simply to spelling or different abbreviations for "street" or
"mailbox") don't signify much. However, if they are the same that's a
fairly strong indicator.

Nobel Peace Center, Oslo

Example data
Let's look at how this works on real data. I'll pass over the cases where
the organization number is the same, and look at ones were the other
attributes do the job. The data have been slightly modified so that I'm
not exposing customer data on the web.

Property Record 1 Record 2 Probability


Name acme inc acme inc 0.9
Assoc. no 177477707 0.5
Zip code 9161 9161 0.6
Country norway norway 0.51
Address 1 mb 113 mailbox 113 0.49
Address 2 0.5

This sums out to 0.93 probability, which is well above the threshold, as
indeed it should be. You'll note that my cleaning of the address is not
sophisticated enough, so it doesn't catch that ADDRESS1 is in fact the
same. This is easy to fix.

The beauty of this approach is that it works astoundingly well for the
amount of effort involved. After just two hours I find huge numbers of
duplicates with low false positive rates. The main downside is that the
pairwise comparison is slow for a table of 17,000 rows, but even this
can be solved. And perhaps the best part is that once the basic (trivial)
framework is up, doing another table takes just a few minutes.

Note that I'm not necessarily recommending that you do this yourself.
There are real tools out there, both open source and commercial, which
I'm sure do a vastly better job than a Python script cooked up in two
hours, but I thought this was useful for illustrating the concept is, and
how much you can get done with simple methods.
More advanced stuff
Of course, simply comparing the names as strings is not very
sophisticated. What if one company is called "acme inc" and another
simpy "acme"? This is what the name_compare function briefly referenced
in the example takes care of. Essentially, it breaks up the names into
tokens, then compares the sets of tokens. At the moment it does so
fairly primitively, but it's quite effective. Unfortunately, it leads to some
false positives at the moment. My next task is to figure out how to
improve that.

This, by the way, is one area where the ready-made products shine.
They understand misspellings, different transcriptions of the same
names, matching of complex strings, and so on. So once you need these
things (or performance) you may be better off with a ready-made
product.

Вам также может понравиться