Добро пожаловать в Scribd!

Пропустить карусель

Jaccard Similarity Join: The Code

Загружено:

Prasanna Kumar

0% нашли этот документ полезным (0 голосов)

14 просмотров3 страницы

spark sql

Оригинальное название

Авторское право

Доступные форматы

DOCX, PDF, TXT или читайте онлайн в Scribd

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Пожаловаться на этот документ

spark sql

Авторское право:

Доступные форматы

Скачайте в формате DOCX, PDF, TXT или читайте онлайн в Scribd

Отметить как неприемлемый контент

0% нашли этот документ полезным (0 голосов)

14 просмотров3 страницы

Jaccard Similarity Join: The Code

Загружено:

Prasanna Kumar

spark sql

Авторское право:

Доступные форматы

Скачайте в формате DOCX, PDF, TXT или читайте онлайн в Scribd

Отметить как неприемлемый контент

Перейти к странице

Вы находитесь на странице: 1из 3

Поиск в документе

the attached code first, and then implement the remaining four functions:

preprocessDF(), filtering(), verification(), and evaluate().

The Code:

entity_resolution.py

The Data set:

amazon-google-sample.zip

Output:
The program should output the following when running on the provided data:
Before filtering: 256 pairs in total
After Filtering: 79 pairs left
After Verification: 5 similar pairs
(precision, recall, fmeasure) = (1.0, 0.3125, 0.47619047619047616)

Jaccard Similarity Join

Task A. Data Preprocessing (Record --> Token Set)

Since Jaccard needs to take two sets as input, your first job is to preprocess DataFrames by
transforming each record into a set of tokens. Please implement the following function.

Hints.

If you have mastered the use of UDF and withColumn by doing Assignment 3, you
should have no problem to finish this task. One small hint is to take a look at
concat_ws.

For the purpose of testing, you can compare your outputs with newDF1 and
newDF2 that can be found from the test folder of the Amazon-Google-Sample dataset.

Task B. Filtering Obviously Non-matching Pairs

Hints.

You need to construct an inverted index for df1 and df2, respectively. The inverted index is a
DataFrame with two columns: token and id, which stores a mapping from each token to a record
that contains the token. You might need to use flatMap to obtain the inverted index.

For the purpose of testing, you can compare your output with candDF that can be found from
the test folder of the Amazon-Google-Sample dataset.

Task C. Computing Jaccard Similarity for Survived Pairs

In the second phase of the filtering-and-verification framework, you need to compute the Jaccard
similarity for each survived pair and return those pairs whose jaccard similarity values are no
smaller than the specified threshold.
In Task C, your job is to implement the verification function. This task looks simple, but there are
a few small "traps" (see the hints below).

Hints.

You need to implement a function for computing the Jaccard similarity between two
joinKeys. Since the function will be called for many times, you have to think about
what's the most efficient implementation for the function. Furthermore, you also need
to consider some edge cases in the function.

For the purpose of testing, you can compare your output with resultDF that can be
found from the test folder of the Amazon-Google-Sample dataset.

Task D. Evaluating an ER result

Hints. It's likely that |R|, |A|, or Precision+Recall are equal to zero, so please pay attention to some edge
cases.

Вам также может понравиться

Java Problems with Solutions
От Everand
Java Problems with Solutions
Mayank Arora
Рейтинг: 4.5 из 5 звезд
4.5/5 (18)
Flipkart Invoice
Документ1 страница
Flipkart Invoice
Deepak Sharma
86% (21)
Flipkart Bill
Документ2 страницы
Flipkart Bill
Prasanna Kumar
50% (14)
Lab 2
Документ4 страницы
Lab 2
geoaamer
100% (1)
Lab 2 - Higher Order Functions - CS 61A Summer 2019 PDF
Документ14 страниц
Lab 2 - Higher Order Functions - CS 61A Summer 2019 PDF
zhen hu
Оценок пока нет
CS1702 Worksheet 7 - Built in Functions and Methods v1 (2022-2023)
Документ8 страниц
CS1702 Worksheet 7 - Built in Functions and Methods v1 (2022-2023)
John Moursy
Оценок пока нет
Randoop Tutorial PDF
Документ5 страниц
Randoop Tutorial PDF
Sahodara reddy
Оценок пока нет
Lab-11 Random Forest
Документ2 страницы
Lab-11 Random Forest
KamranKhan
Оценок пока нет
Software Testing Lab 5: Automated Unit Test Generation
Документ10 страниц
Software Testing Lab 5: Automated Unit Test Generation
Толганай Кыдырмоллаева
Оценок пока нет
Asic Lab3
Документ11 страниц
Asic Lab3
balukrish2018
Оценок пока нет
2324 BigData Lab3
Документ6 страниц
2324 BigData Lab3
Elie Al Howayek
Оценок пока нет
CSCE 120: Learning To Code: Organizing Data I Hacktivity 12.1
Документ3 страницы
CSCE 120: Learning To Code: Organizing Data I Hacktivity 12.1
s_gamal15
Оценок пока нет
Cs294a 2011 Assignment
Документ5 страниц
Cs294a 2011 Assignment
Jose
Оценок пока нет
GenArise Images
Документ32 страницы
GenArise Images
Anonymous MqprQvjEK
Оценок пока нет
Data Mining Exercise 3
Документ11 страниц
Data Mining Exercise 3
Mohamed Boukhari
Оценок пока нет
Paren Lab
Документ5 страниц
Paren Lab
Shobiitaa Krish
Оценок пока нет
ISTA 130: Fall 2020 Programming Assignment 2 Functions
Документ7 страниц
ISTA 130: Fall 2020 Programming Assignment 2 Functions
tts
Оценок пока нет
Ex 2
Документ13 страниц
Ex 2
sumerian786
Оценок пока нет
Assignment3 6572023021
Документ20 страниц
Assignment3 6572023021
Nattapat Tantapong
Оценок пока нет
Call MATLAB Function From C#
Документ11 страниц
Call MATLAB Function From C#
maherkamel
Оценок пока нет
Homework 1
Документ9 страниц
Homework 1
Tomás Calderón
Оценок пока нет
Project0 Testing
Документ4 страницы
Project0 Testing
sdfkdnbvbr
Оценок пока нет
Databricks Spark Knowledge Base
Документ22 страницы
Databricks Spark Knowledge Base
Lokesh Dikshi
Оценок пока нет
Test-Driven APIs With Laravel and Pest Sample Chapter
Документ32 страницы
Test-Driven APIs With Laravel and Pest Sample Chapter
Jendela Kayu
Оценок пока нет
3.5.7 Lab - Create A Python Unit Test
Документ9 страниц
3.5.7 Lab - Create A Python Unit Test
Willy Dinata
Оценок пока нет
Using Categorical Data With One Hot Encoding - Kaggle PDF
Документ4 страницы
Using Categorical Data With One Hot Encoding - Kaggle PDF
Mathias Mbizvo
Оценок пока нет
TD2345
Документ3 страницы
TD2345
ashitaka667
Оценок пока нет
Using Code Blocks, Again: One More Time..
Документ5 страниц
Using Code Blocks, Again: One More Time..
Jose Cordero
Оценок пока нет
An Empirical Study On Apache Spark
Документ15 страниц
An Empirical Study On Apache Spark
Lokesh Dikshi
Оценок пока нет
Testing in Python Using Doctest Module
Документ3 страницы
Testing in Python Using Doctest Module
Ahmed Mohamed
Оценок пока нет
Using Car Functions in Other Functions: 1 Deltamethod
Документ7 страниц
Using Car Functions in Other Functions: 1 Deltamethod
suresh1969
Оценок пока нет
Assignment 1-Preprocessing Handon
Документ6 страниц
Assignment 1-Preprocessing Handon
Ch Ubaid Warraich
Оценок пока нет
ML Coursera Python Assignments
Документ20 страниц
ML Coursera Python Assignments
M
Оценок пока нет
Data Science and Machine Learning Essentials: Lab 4A - Working With Regression Models
Документ24 страницы
Data Science and Machine Learning Essentials: Lab 4A - Working With Regression Models
aussatris
Оценок пока нет
Curve Fitting With Scilab
Документ8 страниц
Curve Fitting With Scilab
Diana Nahiely
Оценок пока нет
3.5.7 Lab - Create A Python Unit Test
Документ15 страниц
3.5.7 Lab - Create A Python Unit Test
Samuel Garcia
Оценок пока нет
CSC2626: Assignment 1 Due January 28 at 6pm ET 25 Points
Документ2 страницы
CSC2626: Assignment 1 Due January 28 at 6pm ET 25 Points
Beerbhan Naru
Оценок пока нет
PA4
Документ8 страниц
PA4
akhaye047
Оценок пока нет
Taller Laboratorios Módulo 4 Python
Документ11 страниц
Taller Laboratorios Módulo 4 Python
Santiago Rivera
Оценок пока нет
MATLAB Integration
Документ7 страниц
MATLAB Integration
Jay Srivastava
100% (1)
MIT6 189IAP11 hw2
Документ8 страниц
MIT6 189IAP11 hw2
Ali Akhavan
Оценок пока нет
Assignment 1-Preprocessing Handon
Документ13 страниц
Assignment 1-Preprocessing Handon
suleman045
Оценок пока нет
Java Notes
Документ36 страниц
Java Notes
Vignesh Murali
Оценок пока нет
CIS-355A Lab 5B: Purpose
Документ1 страница
CIS-355A Lab 5B: Purpose
rondnew_906891183
Оценок пока нет
C - Notes (Data Planet)
Документ142 страницы
C - Notes (Data Planet)
Akash Shinde
Оценок пока нет
PowerShell Optimization and Performance Testing
Документ3 страницы
PowerShell Optimization and Performance Testing
ignacio fernandez luengo
Оценок пока нет
What To Do If Your Solution Doesn't Work?
Документ5 страниц
What To Do If Your Solution Doesn't Work?
Syed Khoab
Оценок пока нет
WEKA Lab Manual
Документ107 страниц
WEKA Lab Manual
Ramesh Kumar
100% (1)
Programming Automation Using Object Oriented Python and Pandas
Документ6 страниц
Programming Automation Using Object Oriented Python and Pandas
Dusan WEB
Оценок пока нет
Assignment 2
Документ6 страниц
Assignment 2
raosaheb
Оценок пока нет
Lab Manual - AETN2302 - L2 (Lirterals and Variables)
Документ7 страниц
Lab Manual - AETN2302 - L2 (Lirterals and Variables)
Zille Huma
Оценок пока нет
BES - R Lab 1
Документ4 страницы
BES - R Lab 1
Viem Anh
Оценок пока нет
CS 116 Spring 2020 Lab #05: Due: Wednesday, February 26 Points: 20
Документ6 страниц
CS 116 Spring 2020 Lab #05: Due: Wednesday, February 26 Points: 20
Andrew Cordell
Оценок пока нет
Unit 4 BDA
Документ31 страница
Unit 4 BDA
Amritha
Оценок пока нет
Lecture 8 July2015
Документ22 страницы
Lecture 8 July2015
Pulak Kundu
Оценок пока нет
Testing and Debugging: Chapter Goals
Документ28 страниц
Testing and Debugging: Chapter Goals
Ani Ani
Оценок пока нет
Search For Potential Functional Issues With Code Inspector
Документ11 страниц
Search For Potential Functional Issues With Code Inspector
Esther Vizarro
Оценок пока нет
# Assignment 4&5 (Combined) (Clustering & Dimension Reduction)
Документ15 страниц
# Assignment 4&5 (Combined) (Clustering & Dimension Reduction)
raosaheb
Оценок пока нет
Data Driven Testing
Документ4 страницы
Data Driven Testing
Muthukrishnan Srinivasan
Оценок пока нет
Optimization Tricks For Spark Engineers & Developers
Документ1 страница
Optimization Tricks For Spark Engineers & Developers
Deepa Nair
Оценок пока нет
Variable. A Variable May Also Point To An Array of Numbers or Strings. in Lab 5
Документ3 страницы
Variable. A Variable May Also Point To An Array of Numbers or Strings. in Lab 5
azqq
Оценок пока нет
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
От Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
Оценок пока нет
Affidavit of Director For DIN Allotment DIR4
Документ1 страница
Affidavit of Director For DIN Allotment DIR4
raajverma1000m
Оценок пока нет
Rent Receipt
Документ1 страница
Rent Receipt
sruppula
Оценок пока нет
Presentation Spark - RDD
Документ5 страниц
Presentation Spark - RDD
Prasanna Kumar
Оценок пока нет
Raspberry PI 2 Tech Data Sheet
Документ132 страницы
Raspberry PI 2 Tech Data Sheet
Benjamin Dover
Оценок пока нет
Anti Ragging Affidavit
Документ4 страницы
Anti Ragging Affidavit
sambitpgdba
Оценок пока нет
Least Squares Method For Factor Analysis
Документ72 страницы
Least Squares Method For Factor Analysis
Prasanna Kumar
Оценок пока нет
Spar KKKK
Документ3 страницы
Spar KKKK
Prasanna Kumar
Оценок пока нет
CA Course Overview
Документ2 страницы
CA Course Overview
Prasanna Kumar
Оценок пока нет
Crawl Resultscrawlresults
Документ6 страниц
Crawl Resultscrawlresults
Prasanna Kumar
Оценок пока нет
PGPFE
Документ49 страниц
PGPFE
Prasanna Kumar
Оценок пока нет
ABC
Документ1 страница
ABC
Prasanna Kumar
Оценок пока нет
Als
Документ4 страницы
Als
Prasanna Kumar
Оценок пока нет
IPF IPO Bharti Infratel
Документ5 страниц
IPF IPO Bharti Infratel
Prasanna Kumar
Оценок пока нет
Prasanna Kumar CL
Документ1 страница
Prasanna Kumar CL
Prasanna Kumar
Оценок пока нет
Awareness Scholarship Rules
Документ7 страниц
Awareness Scholarship Rules
Prasanna Kumar
Оценок пока нет
BSE Company Research Update - KSE LTD
Документ3 страницы
BSE Company Research Update - KSE LTD
Prasanna Kumar
Оценок пока нет
Consent Form: Republic of The Philippines Province of - Municipality of
Документ1 страница
Consent Form: Republic of The Philippines Province of - Municipality of
Lucette Legaspi Estrella
Оценок пока нет
Sheltered 2 Item Recycle List
Документ5 страниц
Sheltered 2 Item Recycle List
Rachel G
Оценок пока нет
Vygotsky Essay
Документ3 страницы
Vygotsky Essay
api-526165635
Оценок пока нет
Directorate of Technical Education, Admission Committee For Professional Courses (ACPC), Gujarat
Документ2 страницы
Directorate of Technical Education, Admission Committee For Professional Courses (ACPC), Gujarat
gamailkabaaaap
Оценок пока нет
Working Capital Management 2012 of HINDALCO INDUSTRIES LTD.
Документ98 страниц
Working Capital Management 2012 of HINDALCO INDUSTRIES LTD.
Pratyush Dubey
100% (1)
Ismb Itp
Документ3 страницы
Ismb Itp
Kumar Abhishek
Оценок пока нет
Truss-Design 18m
Документ6 страниц
Truss-Design 18m
ARSE
Оценок пока нет
Chapter 2.3.3 History of Visual Arts Modernism Post Modernism
Документ17 страниц
Chapter 2.3.3 History of Visual Arts Modernism Post Modernism
Kim Ashley Sarmiento
Оценок пока нет
Guided-Discovery Learning Strategy and Senior School Students Performance in Mathematics in Ejigbo, Nigeria
Документ9 страниц
Guided-Discovery Learning Strategy and Senior School Students Performance in Mathematics in Ejigbo, Nigeria
Alexander Decker
Оценок пока нет
Chapter 20 AP Questions
Документ6 страниц
Chapter 20 AP Questions
florenciashura
Оценок пока нет
9881 en
Документ345 страниц
9881 en
Said Ben
Оценок пока нет
DS Important Questions
Документ15 страниц
DS Important Questions
Lavanya J
Оценок пока нет
CTS2 HMU Indonesia - Training - 09103016
Документ45 страниц
CTS2 HMU Indonesia - Training - 09103016
Resort1.7 Mri
100% (1)
Oceanarium: Welcome To The Museum Press Release
Документ2 страницы
Oceanarium: Welcome To The Museum Press Release
Candlewick Press
Оценок пока нет
Drager Narkomed 6400 Field Service Procedure Software Version 4.02 Enhancement
Документ24 страницы
Drager Narkomed 6400 Field Service Procedure Software Version 4.02 Enhancement
Amir
Оценок пока нет
SavannahHarbor5R Restoration Plan 11 10 2015
Документ119 страниц
SavannahHarbor5R Restoration Plan 11 10 2015
siamak dadashzade
Оценок пока нет
G1000 Us 1014 PDF
Документ820 страниц
G1000 Us 1014 PDF
Luís Miguel Romão
Оценок пока нет
Albert Roussel, Paul Landormy
Документ18 страниц
Albert Roussel, Paul Landormy
mmarriuss7
Оценок пока нет
X - WORMWOOD EVENT IMMEDIATE - Paranormal - 4chan
Документ7 страниц
X - WORMWOOD EVENT IMMEDIATE - Paranormal - 4chan
Anonymous dIjB7XD8Z
Оценок пока нет
Omnitron Catalog
Документ180 страниц
Omnitron Catalog
jamal Alawsu
Оценок пока нет
105 2
Документ17 страниц
105 2
Diego Tobr
Оценок пока нет
Getting Started With Citrix NetScaler
Документ252 страницы
Getting Started With Citrix NetScaler
sudharaghavan
Оценок пока нет
Extract The .Msi Files
Документ2 страницы
Extract The .Msi Files
vladimir
Оценок пока нет
Past Simple Vs Past Continuous
Документ3 страницы
Past Simple Vs Past Continuous
Natalia Salinas
Оценок пока нет
RPH Week 31
Документ8 страниц
RPH Week 31
bbwowo
Оценок пока нет
Fortigate Firewall Version 4 OS
Документ122 страницы
Fortigate Firewall Version 4 OS
Sam Mani Jacob D
Оценок пока нет
Loop Types and Examples
Документ19 страниц
Loop Types and Examples
Surendran K Surendran
Оценок пока нет
Test Bank For The Psychology of Health and Health Care A Canadian Perspective 5th Edition
Документ36 страниц
Test Bank For The Psychology of Health and Health Care A Canadian Perspective 5th Edition
load.notablewp0oz
100% (37)
SDN Van Notes
Документ26 страниц
SDN Van Notes
mjsmith11
Оценок пока нет
Low Speed Aerators PDF
Документ13 страниц
Low Speed Aerators PDF
Dgk Raju
Оценок пока нет