Вы находитесь на странице: 1из 3

the attached code first, and then implement the remaining four functions:

preprocessDF(), filtering(), verification(), and evaluate().

The Code:

entity_resolution.py

The Data set:

amazon-google-sample.zip

Output:
The program should output the following when running on the provided data:
Before filtering: 256 pairs in total
After Filtering: 79 pairs left
After Verification: 5 similar pairs
(precision, recall, fmeasure) = (1.0, 0.3125, 0.47619047619047616)

Jaccard Similarity Join

Task A. Data Preprocessing (Record --> Token Set)

Since Jaccard needs to take two sets as input, your first job is to preprocess DataFrames by
transforming each record into a set of tokens. Please implement the following function.

Hints.

If you have mastered the use of UDF and withColumn by doing Assignment 3, you
should have no problem to finish this task. One small hint is to take a look at
concat_ws.

For the purpose of testing, you can compare your outputs with newDF1 and
newDF2 that can be found from the test folder of the Amazon-Google-Sample dataset.

Task B. Filtering Obviously Non-matching Pairs

Hints.

You need to construct an inverted index for df1 and df2, respectively. The inverted index is a
DataFrame with two columns: token and id, which stores a mapping from each token to a record
that contains the token. You might need to use flatMap to obtain the inverted index.

For the purpose of testing, you can compare your output with candDF that can be found from
the test folder of the Amazon-Google-Sample dataset.

Task C. Computing Jaccard Similarity for Survived Pairs


In the second phase of the filtering-and-verification framework, you need to compute the Jaccard
similarity for each survived pair and return those pairs whose jaccard similarity values are no
smaller than the specified threshold.
In Task C, your job is to implement the verification function. This task looks simple, but there are
a few small "traps" (see the hints below).

Hints.

You need to implement a function for computing the Jaccard similarity between two
joinKeys. Since the function will be called for many times, you have to think about
what's the most efficient implementation for the function. Furthermore, you also need
to consider some edge cases in the function.

For the purpose of testing, you can compare your output with resultDF that can be
found from the test folder of the Amazon-Google-Sample dataset.

Task D. Evaluating an ER result

Hints. It's likely that |R|, |A|, or Precision+Recall are equal to zero, so please pay attention to some edge
cases.

Вам также может понравиться