Вы находитесь на странице: 1из 41

Data Preprocessing

 In this section we address the issues of which


preprocessing steps should be applied to make
the data more suitable for data mining.
 Data preprocessing is a broad area and consists
of a number of different strategies and
techniques that are interrelated in complex
ways.
The following topics are important to follow certain
ideas and approaches to point out the
interrelationships
 Aggregation
 Sampling
 Dimensionality Reduction
 Feature subset selection
 Feature creation
 Discretization and Binarization
 Attribute Transformation
these items fall into two categories :
1. Selecting data objects and
2. Attributes for the analysis or creating/changing the
attributes.
Aggregation
 Sometimes "less is more" and this is the case with
aggregation, the combining of two or more
objects(attributes) into a single object.
 Aggregation is the process of eliminating attributes, such as
the type of item, or reducing the number of values for a
particular attributes;
Ex: reducing the possible values for date from 365 days to 12
months.
 Consider a data set consisting of transactions (data
objects) recording the daily sales of products in various
store locations (Minneapolis, Chicago, Paris, ...) for different
days over the course of a year
 Purpose of aggregation
◦ Data reduction
 Reduce the number of attributes or objects
◦ Change of scale
 Cities aggregated into regions, states, countries, etc
◦ More “stable” data
 Aggregated data tends to have less variability

 Note: A disadvantage of aggregation is the


potential loss of interesting details. In the store
example aggregating over months loses
information about which day of the week has the
highest sales
Sampling
 Sampling is the main technique employed for data
selection. Sampling is a commonly used approach for
selecting a subset of the data objects to be analyzed.
◦ It is often used for both the preliminary
investigation of the data and the final data analysis.
◦ Statisticians sample because obtaining the entire set
of data of interest is too expensive or time
consuming.
◦ Sampling is used in data mining because processing
the entire set of data of interest is too expensive or
time consuming
Dimensionality Reduction
 Purpose:
◦ Avoid curse of dimensionality
◦ Reduce amount of time and memory required by data
mining algorithms
◦ Allow data to be more easily visualized
◦ May help to eliminate irrelevant features or reduce
noise

 Techniques
◦ Principle Component Analysis
◦ Singular Value Decomposition
◦ Others: supervised and non-linear techniques
Feature Subset Selection
 Another way to reduce dimensionality of data
 Redundant features
◦ duplicate much or all of the information contained in
one or more other attributes
◦ Example: purchase price of a product and the
amount of sales tax paid
 Irrelevant features
◦ contain no information that is useful for the data
mining task at hand
◦ Example: students' ID is often irrelevant to the task
of predicting students' GPA
Feature Subset Selection
 Techniques:
◦ Brute-force approach:
 Try all possible feature subsets as input to data
mining algorithm

◦ Embedded approaches:
 Feature selection occurs naturally as part of the
data mining algorithm. Algorithms for building
decision tree classifiers, often operate in this
manner
Feature Creation
 Create new attributes that can capture the important
information in a data set much more efficiently than the
original attributes
 Three general methodologies:
◦ Feature Extraction: The creation of new set of
features from the original raw data is known as
feature extraction.
◦ Mapping Data to New Space: if there are no.of
periodic patterns and a significant amount of noise is
present, then these patterns are hard to detect. Such
patterns are often detected by applying a Fourier
transform.
Discretization and Binarization
 It is often necessary to transform a continuous
attribute into a categorical attribute called
Discretization, and both continuous and discrete
attributes may need to be transformed into one or
more binary attributes called Binarization.
Binarization

 A simple technique to binarize a categorical attribute


is the following: if there are m categorical values, then
uniquely assign each original value to an integer in the
interval [0,m-1].
 Next convert each of these m integers to a binary
number.
Cat.val Integer val X1 X2 X3 X4 X5
Awful 0 1 0 0 0 0
Poor 1 0 1 0 0 0
Ok 2 0 0 1 0 0
Good 3 0 0 0 1 0
Great 4 0 0 0 0 1
 The fig. illustrates A categorical value with 5
values{awful,poor,ok,good,great} would require five
binary variables x1,x2,x3, x4,x5.
 and furthermore, association analysis require
asymmetric binary attributes, where only the
presence of the attribute value=1 is important.
Measures of Similarity and Dissimilarity
 Similarity
◦ Numerical measure of how alike two data objects are.
◦ Is higher when objects are more alike.
◦ Often falls in the range [0,1]
 Dissimilarity
◦ Numerical measure of how different are two data
objects
◦ Lower when objects are more alike
◦ Minimum dissimilarity is often 0
◦ Upper limit varies
 Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity for Simple
Attributes
p and q are the attribute values for two data objects.
Dissimilarities between data
objects: Euclidean Distance
 Euclidean Distance

Where n is the number of dimensions (attributes)


and pk and qk are, respectively, the kth attributes
(components) or data objects p and q.

 Standardization is necessary, if scales differ.

n
dist   ( pk  qk ) 2
k 1
Euclidean Distance

p1
point x y
2
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

Distance Matrix
Minkowski Distance
 Minkowski Distance is a generalization of Euclidean
Distance

Where r is a parameter, n is the number of


dimensions (attributes) and pk and qk are,
respectively, the kth attributes (components) or
data objects p and q.
1
n
dist  (  | pk  qk | ) r r
k 1
Minkowski Distance: Examples
 r = 1. City block (Manhattan, taxicab, L1 norm)
distance.
◦ A common example of this is the Hamming
distance, which is just the number of bits that are
different between two binary vectors
 r = 2. Euclidean distance
 r  . “supremum” (Lmax norm, L norm) distance.
◦ This is the maximum difference between any
component of the vectors
 Do not confuse r with n, i.e., all these distances are
defined for all numbers of dimensions.
Minkowski Distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0

Distance Matrix
Mahalanobis Distance:
Euclidean distance was used to measure the distance between
people based in two attributes: age and income.
Mahalanobis distance, is useful when attributes are correlated, have
different ranges of values , and the distribution of the data is
approximately normal.

mahalanobis( p, q)  ( p  q) 1( p  q)T


 is the covariance matrix of the
input data X

1 n
 j,k   ( X ij  X j )( X ik  X k )
n  1 i1

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.


Mahalanobis Distance
Covariance Matrix:

B A: (0.5, 0.5)
B: (0, 1)
A C: (1.5, 1.5)

Mahal(A,B) = 5
Mahal(A,C) = 4
Common Properties of a Distance
 Distances, such as the Euclidean distance, have some
well known properties.

1. d(p, q)  0 for all p and q and d(p, q) = 0 only if


p = q. (Positive definiteness)
2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r)  d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between
points (data objects), p and q.

 A distance that satisfies these properties is a metric


Common Properties of a
Similarity
 Similarities, also have some well known
properties.

1. s(p, q) = 1 (or maximum similarity) only if p = q.

2. s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data


objects), p and q.
Similarity Between Binary Vectors
 Similarity measures between objects that contain
only binary attributes are called similarity
coefficients, and typically have values between 0 and
1.

 Common situation is that objects, p and q, have


only binary attributes
 Compute similarities using the following quantities
M01 = the number of attributes where p was 0
and q was 1
M10 = the number of attributes where p was 1
and q was 0
M00 = the number of attributes where p was 0
and q was 0
M11 = the number of attributes where p was 1
and q was 1

 Simple Matching Coefficient:


SMC = number of matches / number of
attributes
= (M11 + M00) / (M01 + M10 + M11 +
M00)
Jaccard Coefficient

 Suppose that x and y are data objects that represents


two rows of transaction matrix. If each asymmetric
binary attribute corresponds to an item in a store, then
a 1 indicates that the item was purchased or 0
indicates that the product was not purchased.
Jaccard coefficient is frequently used to handle objects
consisting of asymmetric binary attributes.

J = number of 11 matches / number of not-both-


zero attributes values
= (M11) / (M01 + M10 + M11)
SMC versus Jaccard: Example
p= 1000000000
q= 0000001001

M01 = 2 (the number of attributes where p was 0 and q was 1)


M10 = 1 (the number of attributes where p was 1 and q was 0)
M00 = 7 (the number of attributes where p was 0 and q was 0)
M11 = 0 (the number of attributes where p was 1 and q was 1)

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7

J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0


Cosine Similarity
 cosine similarity really is a measure of the angle between
x and y . Thus , if the cosine similarity is 1, the angel
between x and y is 0 degree, and x and y are the same
except for magnitude.
If the cosine similarity is 0, then the angel between x and y
is 90 degrees and they do not share any terms(words).
If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1  d2) / ||d1|| ||d2|| ,
where  indicates vector dot product and || d || is the
length of vector d.
Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1  d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 +
2*1 + 0*0 + 0*2 = 5
||d1|| =
(3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 =
(42) 0.5 = 6.481
||d2|| =
(1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 =
(6) 0.5 = 2.245
cos( d1, d2 ) = .3150
Extended Jaccard Coefficient
(Tanimoto)
 Variation of Jaccard for continuous or count attributes
◦ Reduces to Jaccard for binary attributes
Correlation
 Correlation measures the linear relationship between
objects
 To compute correlation, we standardize data objects, p
and q, and then take their dot product

pk  ( pk  mean( p)) / std ( p)


qk  (qk  mean(q)) / std (q)

correlation( p, q)  p  q
Exploring Data:
Preliminary Investigation of the data
in order to better understand its
specific characteristics.
What is data exploration?
A preliminary exploration of the data to
better understand its characteristics.
 Key motivations of data exploration include
◦ Helping to select the right tool for preprocessing or analysis
◦ Making use of humans’ abilities to recognize patterns
 People can recognize patterns not captured by
data analysis tools

 Related to the area of Exploratory Data Analysis


(EDA)
◦ Created by statistician John Tukey
◦ Seminal book is Exploratory Data Analysis by Tukey
◦ A nice online introduction can be found in Chapter 1 of the
NIST Engineering Statistics Handbook
Techniques Used In Data Exploration
 In EDA, as originally defined by Tukey
◦ The focus was on visualization
◦ Clustering and anomaly detection were viewed as
exploratory techniques
◦ In data mining, clustering and anomaly detection are
major areas of interest, and not thought of as just
exploratory
 In our discussion of data exploration, we focus on
◦ Summary statistics(Mean, SD, Range,Variance etc.,)
◦ Visualization(Histograms and Scatter Plots)
◦ Online Analytical Processing (OLAP) for exploring
Multidimensional arrays of values
Iris Sample Data Set
 Many of the exploratory data techniques are illustrated with the
Iris Plant data set.
Can be obtained from the UCI Machine Learning Repository

From the statistician Douglas Fisher


Three flower types (classes):
 Setosa
 Virginica
 Versicolour
◦ Four (non-class) attributes
 Sepal width and length
 Petal width and length Virginica. Robert H. Mohlenbrock. USDA
NRCS. 1995. Northeast wetland flora: Field
office guide to plant species. Northeast
National Technical Center, Chester, PA.
Courtesy of USDA NRCS Wetland Science
Institute.
Summary Statistics
◦ Summary statistics are numbers that summarize
properties of the data

◦ Summarized properties include frequency, location


and spread
 Examples: location - mean
spread - standard deviation

◦ Most summary statistics can be calculated in a single


pass through the data
Frequency and Mode
 The frequency of an attribute value is the percentage
of time the value occurs in the
data set
◦ For example, given the attribute ‘gender’ and a
representative population of people, the gender
‘female’ occurs about 50% of the time.
 The mode of a an attribute is the most frequent
attribute value
 The notions of frequency and mode are typically
used with categorical data
Percentiles
 For continuous data, the notion of a percentile is more
useful.

Given an ordinal or continuous attribute x and a number p


between 0 and 100, the pth percentile is a value of x
such that p% of the observed values of x are less than .
xp
xp
 For instance, the 50th percentile is the value such that
50% of all values of x are less than x 50% .



Measures of Location: Mean and
Median
 The mean is the most common measure of the location
of a set of points.
 However, the mean is very sensitive to outliers.
 Thus, the median or a trimmed mean is also commonly
used.
Measures of Spread: Range and
Variance

 Range is the difference between the max and min


 The variance or standard deviation is the most
common measure of the spread of a set of points.
 However, this is also sensitive to outliers, so that other
measures are often used. they are absolute average
deviation, median absolute deviation and IQR.

Вам также может понравиться