Вы находитесь на странице: 1из 23

:

Presented by:
 Kunal Jain (071309)
Under the guidance of
Mr. Praveen Kumar Tripathi
Dept of CSE & IT (JUIT)
 Introduction
 Steps in Data Cleansing
 Conclusion
 References
“A company’s most important asset is information. A
corporation’s ability to compete, adapt, and grow in a
business climate of rapid change is dependent in large
measure on how well the company uses information
to make decisions. Sharing information that isn’t
clean and consolidated to the fullest extent can
substantially reduce the effectiveness of a system of
significant investment and considerable pay-off
potential.”
 Data cleansing or data scrubbing is the act of
detecting and correcting (or removing) corrupt or
inaccurate records from a record set, table, or
database. Used mainly in databases, the term
refers to identifying
incomplete, incorrect, inaccurate, irrelevant etc.
parts of the data and then replacing, modifying or
deleting this dirty data.
•Data cleansing can occur within a single set of records, or
between multiple sets of data which need to be merged, or
which will work together.

•Typos and spelling errors are corrected, mislabeled data


is properly labeled and filed, and incomplete or missing
entries are completed.

•In more complex operations, data cleansing can be


performed by computer programs. These data cleansing
programs can check the data with a variety of rules and
procedures decided upon by the user
•The goal of data cleansing is not just to clean up the data
in a database but also to bring consistency to different sets
of data that have been merged from separate databases.
Dummy Values,
Absence of Data,
Multipurpose Fields,
Cryptic Data,
Contradicting Data,
Inappropriate Use of Address Lines,
Violation of Business Rules,
Reused Primary Keys,
Non-Unique Identifiers, and
Data Integration Problems
Parsing
Correcting
Standardizing
Matching
Consolidating
Parsing locates and identifies individual data
elements in the source files and then isolates
these data elements in the target files.
Parsed Data in Target File
First Name: Beth
Middle Name: Christine
Input Data from Source File Last Name: Parker
Beth Christine Parker, SLS MGR Title: SLS MGR
Regional Port Authority Firm: Regional Port Authority
Federal Building Location: Federal Building
12800 Lake Calumet Number: 12800
Hedgewisch, IL Street: Lake Calumet
City: Hedgewisch
State: IL
Corrects parsed individual data components
using sophisticated data algorithms and
secondary data sources.
Corrected Data
Parsed Data First Name: Beth
First Name: Beth Middle Name: Christine
Middle Name: Christine Last Name: Parker
Last Name: Parker Title: SLS MGR
Title: SLS MGR Firm: Regional Port Authority
Firm: Regional Port Authority Location: Federal Building
Location: Federal Building Number: 12800
Number: 12800 Street: South Butler Drive
Street: Lake Calumet City: Chicago
City: Hedgewisch State: IL
State: IL Zip: 60633
Zip+Four: 2398
Standardizing applies conversion routines to
transform data into its preferred (and
consistent) format using both standard and
custom business rules.
Corrected Data
Corrected Data Pre-name: Ms.
First Name: Beth First Name: Beth
Middle Name: Christine 1st Name Match
Last Name: Parker Standards: Elizabeth, Bethany, Bethel
Title: SLS MGR Middle Name: Christine
Firm: Regional Port Authority Last Name: Parker
Location: Federal Building Title: Sales Mgr.
Number: 12800 Firm: Regional Port Authority
Street: South Butler Drive Location: Federal Building
City: Chicago Number: 12800
State: IL Street: S. Butler Dr.
Zip: 60633 City: Chicago
Zip+Four: 2398 State: IL
Zip: 60633
Zip+Four: 2398
Searching and matching records within and
across the parsed, corrected and standardized
data based on predefined business rules to
eliminate duplications.
Business Street Branch Customer City Vendor Pattern Pattern
Name Type #/Tax ID Code I.D.

Exact Exact Exact Exact Exact Exact AAAAAA P110

Exact VClose Exact VClose Exact Blanks ABAAA- P115

Exact VClose Exact Blanks Exact Exact ABA-AA P120

Exact VClose Close Close Exact Exact ABCCAA S300

VClose VClose Exact Close Exact Exact BBACAA S310


Corrected Data (Data Source #2)
Corrected Data (Data Source #1) Pre-name: Ms.
Pre-name: Ms. First Name: Elizabeth
First Name: Beth 1st Name Match
1st Name Match Standards: Beth, Bethany, Bethel
Standards: Elizabeth, Bethany, Bethel Middle Name: Christine
Middle Name: Christine Last Name: Parker-Lewis
Last Name: Parker Title:
Title: Sales Mgr. Firm: Regional Port Authority
Firm: Regional Port Authority Location: Federal Building
Location: Federal Building Number: 12800
Number: 12800 Street: S. Butler Dr., Suite 2
Street: S. Butler Dr. City: Chicago
City: Chicago State: IL
State: IL Zip: 60633
Zip: 60633 Zip+Four: 2398
Zip+Four: 2398 Phone: 708-555-1234
Fax: 708-555-5678
Analyzing and identifying relationships between
matched records and consolidating/merging
them into ONE representation.
Consolidated Data
Name: Ms. Beth (Elizabeth)
Corrected Data (Data Source #1) Christine Parker-Lewis
Title: Sales Mgr.
Firm: Regional Port Authority
Location: Federal Building
Address: 12800 S. Butler Dr., Suite 2
Chicago, IL 60633-2398
Corrected Data (Data Source #2)
Phone: 708-555-1234
Fax: 708-555-5678
1.Use metadata to document rules .

2.Determine data cleansing schedule .

3.Build quality into new and existing systems.


Hence we conclude that DATA CLEANSING is
not only an effective tool for removing
unwanted ,“dirty” data ,but also the medium to
make data in our databases and systems
concise, selective and appropriate in order to
server our clients better and cater to their
demands as well.
Web:
 en.wikipedia.org/wiki/Data_cleansing
 www2.gbif.org/DataCleaning.pdf
 www.webopedia.com/TERM/D/data_cleansing.html
Books:
 Data Mining by Ian H. Witten and Eibe Frank

 Exploratory Data Mining and Data Quality


by Dasu and Johnson
(Wiley, 2004)

Вам также может понравиться