Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning

) (
jalali@mshdiua.ac.ir Jalali.mshdiau.ac.ir
Machine Learning
Machine Learning
Data Preprocessing
Why preprocess the data? Data cleaning Data integration and transformation
Data reduction
Discretization and concept hierarchy generation Summary
Why Data Preprocessing?

Data in the real world is dirty
incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
e.g., occupation=
noisy: containing errors or outliers
e.g., Salary=-10
inconsistent: containing discrepancies in codes or names
e.g., Age=42 Birthday=03/07/1997
e.g., Was rating 1,2,3, now rating A, B, C

e.g., discrepancy between duplicate records
4
Why Is Data Dirty?

Incomplete data may come from
Not applicable data value when collected Different considerations between the time when the data was collected and when it is analyzed. Human/hardware/software problems
Noisy data (incorrect values) may come from

Faulty data collection instruments Human or computer error at data entry Errors in data transmission
Inconsistent data may come from

Different data sources Functional dependency violation (e.g., modify some linked data)
Duplicate records also need data cleaning
Multi-Dimensional Measure of Data Quality
A well-accepted multidimensional view:

Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility
Major Tasks in Data Preprocessing

Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same or similar analytical results
Data discretization
Part of data reduction but with particular importance, especially for numerical data
Data Preprocessing
Data Cleaning
Data Integration
-2,32,100,59,48
-0.02,0.32,1.00,0.59,0.48
Data Transformation
Data Reduction
8
8
Data Cleaning
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data Correct inconsistent data Resolve redundancy caused by data integration
Missing Data
Data is not always available
E.g., many tuples have no recorded value for several attributes, such as customer income in sales data
Missing data may be due to

equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data
Missing data may need to be inferred.
10
How to Handle Missing Data?

Fill in the missing value manually: tedious + infeasible? Fill in it automatically with
a global constant : e.g., unknown, a new class?! the attribute mean the attribute mean for all samples belonging to the same class: smarter the most probable value: inference-based such as Bayesian formula or decision tree
11
Noisy Data
Noise: random error or variance in a measured variable Incorrect attribute values may due to
faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention
Other data problems which requires data cleaning

duplicate records incomplete data inconsistent data
12
How to Handle Noisy Data?

Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Binning
first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc
Combined computer and human inspection

detect suspicious values and check by human (e.g., deal with possible outliers)
13
Regression
y
Y1
Fit data to a function. Linear regression finds the best line to fit two variables. Multiple regression can handle multiple variables. The values given by the function are used instead of the original values
Y1
y=x+1
X1
14
14
Cluster Analysis
Similar values are organized into groups (clusters). Values falling outside of clusters may be considered outliers and may be candidates for elimination.
15
15
Binning
partitioning
Divides the range into N intervals, each containing approximately same number of samples
Good data scaling

Managing categorical attributes can be tricky
16
Binning
Original Data for price (after sorting): 4, 8, 15, 21, 21, 24, 25, 28, 34
Binning
Partition into equal depth bins Bin1: 4, 8, 15 Bin2: 21, 21, 24 Bin3: 25, 28, 34 Min and Max values in each bin are identified (boundaries). Each value in a bin is replaced with the closest boundary value.
17
17
Each value in a bin is replaced by the mean value of the bin.
means Bin1: 9, 9, 9 Bin2: 22, 22, 22 Bin3: 29, 29, 29
boundaries Bin1: 4, 4, 15 Bin2: 21, 21, 24 Bin3: 25, 25, 34
Example
ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain
Temperature Humidity Windy 85 85 FALSE 80 90 TRUE 83 78 FALSE 70 96 FALSE 68 80 FALSE 65 70 TRUE 58 65 TRUE 72 95 FALSE 69 70 FALSE 71 80 FALSE 75 70 TRUE 73 90 TRUE 81 75 FALSE 75 80 TRUE
ID 7 6 5 9 4 10 8 12 11 14 2 13 3 1
Temperature 58 65 68 69 70 71 72 73 75 75 80 81 83 85
Bin1
Bin2
Bin3
Bin4 Bin5
18
18
Example
ID 7 6 5 9 4 10 8 12 11 14 2 13 3 1
Temperature 58 65 68 69 70 71 72 73 75 75 80 81 83 85
Bin1
Bin2
Bin3
Bin4 Bin5
ID 7 6 5 9 4 10 8 12 11 14 2 13 3 1
Temperature 64 64 64 70 70 70 73 73 73 79 79 79 84 84
Bin1
Bin2
Bin3
Bin4 Bin5
19
19
Smoothing Noisy Data - Example
The final table with the new values for the Temperature attribute.
ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Temperature Humidity Windy 84 85 FALSE 79 90 TRUE 84 78 FALSE 70 96 FALSE 64 80 FALSE 64 70 TRUE 64 65 TRUE 73 95 FALSE 70 70 FALSE 70 80 FALSE 73 70 TRUE 73 90 TRUE 79 75 FALSE 79 80 TRUE
20
20
Data Integration
Data integration:
Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-id B.cust-#

Integrate metadata from different sources
Detecting and resolving data value conflicts

For the same real world entity, attribute values from different sources are different Possible reasons: different representations, different scales, e.g., metric vs.
British units
Use Ontology to find same entities in the different Database (Wordnet)
21
Handling Redundancy in Data Integration
Redundant data occur often when integration of multiple databases

Object identification: The same attribute or object may have different names in different databases Semantic heterogeneity Derivable data: One attribute may be a derived attribute in another table, e.g., annual revenue
Redundant attributes may be able to be detected by correlation analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
22
Correlation Analysis (Numerical Data)

Correlation coefficient (also called Pearson product-moment correlation coefficient - PMCC)
rA, B
( A A)(B B) ( AB) n AB
(n 1)AB (n 1)AB
where n is the number of tuples, A and
B are the respective means of A and B, A and B are
the respective standard deviation of A and B, and (AB) is the sum of the AB cross-product.
If rA,B > 0, A and B are positively correlated (As values increase as Bs). The higher, the stronger correlation. rA,B = 0: independent; rA,B < 0: negatively correlated
23
Data Transformation
Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range
min-max normalization z-score normalization normalization by decimal scaling
Attribute/feature construction
New attributes constructed from the given ones
24
Data Transformation: Normalization

Min-max normalization: to [new_minA, new_maxA]
v'
v minA (new _ maxA new _ minA) new _ minA maxA minA

98,000 12 ,000
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is 73,600 12 ,000 (1.0 0) 0 0.716 mapped to
Z-score normalization (: mean, : standard deviation):
v'
v A
Ex. Let = 54,000, = 16,000. Then
73,600 54 ,000 1.225 16 ,000
Normalization by decimal scaling
v v' j 10
Where j is the smallest integer such that Max(||) < 1
25
Normalization: Example z-score normalization

Example: normalizing the Humidity attribute:
Humidity 85 90 78 96 80 70 65 95 70 80 70 90 75 80
Mean = 80.3 Stdev = 9.84
Humidity 0.48 0.99 -0.23 1.60 -0.03 -1.05 -1.55 1.49 -1.05 -0.03 -1.05 0.99 -0.54 -0.03
26
Normalization: Example II - Min-max normalization
ID 1 2 3 4 5
Gender F M M F M
Age 27 51 52 33 45
Salary 19,000 64,000 100,000 55,000 45,000
ID 1 2 3 4 5
Gender 1 0 0 1 0
Age 0.00 0.96 1.00 0.24 0.72
Salary 0.00 0.56 1.00 0.44 0.32
27
Data Reduction Strategies

Why data reduction?
A database/data warehouse may store terabytes of data Complex data analysis/mining may take a very long time to run on the complete data set
Data reduction
Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results
Data reduction strategies

Data cube aggregation: Dimensionality reduction e.g., remove unimportant attributes Data Compression Numerosity reduction e.g., fit data into models, Regression, Clustering Discretization and concept hierarchy generation
28
Discretization
Three types of attributes:
Nominal values from an unordered set, e.g., color, profession
Ordinal values from an ordered set, e.g., military or academic rank Continuous real numbers, e.g., integer or real numbers
Discretization:
Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical attributes. Reduce data size by discretization Prepare for further analysis
29
Discretization and Concept Hierarchy

Discretization
Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Supervised vs. unsupervised Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an attribute
Concept hierarchy formation

Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as young, middle-aged, or senior)
30
Discretization - Example
Example: discretizing the Humidity attribute using 3 bins.
Humidity 85 90 78 96 80 70 65 95 70 80 70 90 75 80
Low = 60-69 Normal = 70-79 High = 80+
Humidity High High Normal High High Normal Low High Normal High Normal High Normal High
31
Concept Hierarchy Generation for Categorical Data

Specification of a partial/total ordering of attributes explicitly at the schema level by users or experts
street < city < state < country
Specification of a hierarchy for a set of values by explicit data grouping

{Urbana, Champaign, Chicago} < Illinois
Specification of only a partial set of attributes

E.g., only street < city, not others
Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values
E.g., for a set of attributes: {street, city, state, country}
32
Automatic Concept Hierarchy Generation

Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set
The attribute with the most distinct values is placed at the lowest level of the hierarchy Exceptions, e.g., weekday, month, quarter, year
country province_or_ state city street
15 distinct values 365 distinct values 3567 distinct values 674,339 distinct values
33
( Spatio-Temporal data mining ) IRIS Weka -Pre Processing . ( 01 )Dataset
43
Machine Learning
35

Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning

Загружено:

Авторское право:

Доступные форматы

) (

Why Data Preprocessing?

e.g., Age=42 Birthday=03/07/1997

e.g., Was rating 1,2,3, now rating A, B, C

Why Is Data Dirty?

Noisy data (incorrect values) may come from

Inconsistent data may come from

Duplicate records also need data cleaning

Multi-Dimensional Measure of Data Quality

A well-accepted multidimensional view:

Major Tasks in Data Preprocessing

Missing data may be due to

Missing data may need to be inferred.

How to Handle Missing Data?

Other data problems which requires data cleaning

How to Handle Noisy Data?

Combined computer and human inspection

Good data scaling

Each value in a bin is replaced by the mean value of the bin.

means Bin1: 9, 9, 9 Bin2: 22, 22, 22 Bin3: 29, 29, 29

boundaries Bin1: 4, 4, 15 Bin2: 21, 21, 24 Bin3: 25, 25, 34

Smoothing Noisy Data - Example

Schema integration: e.g., A.cust-id B.cust-#

Detecting and resolving data value conflicts

Handling Redundancy in Data Integration

Redundant data occur often when integration of multiple databases

Correlation Analysis (Numerical Data)

where n is the number of tuples, A and

B are the respective means of A and B, A and B are

Data Transformation: Normalization

v minA (new _ maxA new _ minA) new _ minA maxA minA

Z-score normalization (: mean, : standard deviation):

Ex. Let = 54,000, = 16,000. Then

73,600 54 ,000 1.225 16 ,000

Normalization by decimal scaling

Where j is the smallest integer such that Max(||) < 1

Normalization: Example z-score normalization

Mean = 80.3 Stdev = 9.84

Normalization: Example II - Min-max normalization

Salary 19,000 64,000 100,000 55,000 45,000

Age 0.00 0.96 1.00 0.24 0.72

Salary 0.00 0.56 1.00 0.44 0.32

Data Reduction Strategies

Data reduction strategies

Discretization and Concept Hierarchy

Discretization can be performed recursively on an attribute

Concept hierarchy formation

Low = 60-69 Normal = 70-79 High = 80+

Concept Hierarchy Generation for Categorical Data

Specification of a hierarchy for a set of values by explicit data grouping

Specification of only a partial set of attributes

Automatic Concept Hierarchy Generation

country province_or_ state city street

( Spatio-Temporal data mining ) IRIS Weka -Pre Processing . ( 01 )Dataset

Вам также может понравиться