Академический Документы
Профессиональный Документы
Культура Документы
Data Preprocessing
Data cleaning
Data reduction
Summary
e.g., occupation=
e.g., Salary=-10
human/hardware/software problems
collection
entry
transmission
Data cleaning
Data integration
Data reduction
Data transformation
Data discretization
Forms of data
preprocessing
Data Preprocessing
Data cleaning
Data reduction
Summary
Data Cleaning
Importance
Data cleaning is one of the three biggest
problems in data warehousingRalph Kimball
Data cleaning is the number one problem in
data warehousingDCI survey
10
Missing Data
equipment malfunction
11
12
Noisy Data
13
Binning method:
first sort data and partition into (equi-depth) bins
14
consulting its neighborhood, that is, the values around it. The
sorted values are distributed into a number of buckets, or
bins. Because binning methods consult the neighborhood of
values, they perform local smoothing.
Equal-width (distance) partitioning:
Divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute,
the width of intervals will be: W = (B A)/N.
The most straightforward, but outliers may dominate
presentation
Skewed data is not handled well.
Equal-depth (frequency) partitioning:
Divides the range into N intervals, each containing
approximately same number of samples
Good data scaling
Data Mining: Concepts and
Managing categorical
attributes
can be tricky.
April 27, 2016
Techniques
15
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,
26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
April 27, 2016
16
Cluster Analysis
17
Regression
18
Regression
y
Y1
y=x+1
Y1
X1
19
20
Data cleaning
Data reduction
Summary
21
Data Integration
22
Data Integration
Data integration:
combines data from multiple sources into a
coherent store
Schema integration
integrate metadata from different sources
Entity identification problem: identify real world
entities from multiple data sources, e.g., A.cust-id
B.cust-#
Detecting and resolving data value conflicts
for the same real world entity, attribute values
from different sources are different
possible reasons: different representations,
different scales, e.g., metric vs. British units
23
24
Correlation Analysis
Correlation
analysis
measures
the
relationship between two items, for example,
a security's price and an indicator.
i 1
(ai A)(bi B )
(n 1) A B
i 1
(ai bi ) n A B
(n 1) A B
A
B and
where n is the number of tuples,
are the
respective means of A and B, A and B are the
respective standard deviation of A and B, and
(aibi) is the sum of the AB cross-product.
2 (chi-square) test
2
(
Observed
Expected
)
2
Expected
The larger the 2 value, the more likely the
variables are related
Play
chess
Not play
chess
Sum
(row)
250(90)
200(360)
450
50(210)
1000(840)
1050
Sum(col.)
300
1200
1500
507.93
90
210
360
840
2
Scatter plots
showing the
similarity from
1 to 1.
29
Correlation coefficient:
Some pairs of random variables may have a covariance of 0 but are not
independent.
Only under some additional assumptions (e.g., the data follow multivariate
normal distributions) does a covariance of 0 imply independence
31
Co-Variance: An Example
Data Transformation
Methods
min-max normalization
z-score normalization
Attribute/feature construction
Data Mining: Concepts and
33
Data Transformation:
Normalization
min-max normalization
v minA
v'
(new _ maxA new _ minA) new _ minA
maxA minA
z-score normalization
v meanA
v'
stand _ devA
v
v' j
10
34
Example
73,600 12,000
(1.0 0) 0 0.716
98,000 12,000
73,600 54,000
1.225
16,000
April 27, 2016
35
Data Preprocessing
Data cleaning
Data reduction
Summary
36
Data Reduction
Strategies
Techniques
37
Techniques
38
Dimensionality Reduction
39
A1?
Class 1
>
April 27, 2016
Class 2
Class 1
Class 2
40
Data Compression
String compression
There are extensive theories and well-tuned
algorithms
Typically lossless
But only limited manipulation is possible without
expansion
Audio/video compression
Typically lossy compression, with progressive
refinement
Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
Time sequence is not audio
Typically short and vary slowly with time
Data Mining: Concepts and
Techniques
42
Data Compression
Compressed
Data
Original Data
lossless
Original Data
Approximated
April 27, 2016
y
s
s
lo
43
Wavelet Transformation
Haar2
Daubechie4
Method:
44
Low Pass
Low Pass
Low Pass
High Pass
High Pass
High Pass
45
46
X1
47
Numerosity Reduction
Parametric methods
Log-linear models: obtain value at a point in mD space as the product on appropriate marginal
subspaces
Non-parametric methods
48
49
Linear regression: Y = + X
Two parameters , and specify the line and
are to be estimated by using the data at hand.
using the least squares criterion to the known
values of Y1, Y2, , X1, X2, .
Multiple regression: Y = b0 + b1 X1 + b2 X2.
Many nonlinear functions can be transformed
into the above.
Log-linear models:
The multi-way table of joint probabilities is
approximated by a product of lower-order
tables.
Histograms
30
25
20
15
10
5
100000
90000
80000
70000
60000
50000
0
10000
35
40000
40
30000
A popular data
reduction technique
Divide data into buckets
and store average
(sum) for each bucket
Can be constructed
optimally in one
dimension using
dynamic programming
Related to quantization
problems.
20000
51
Clustering
52
Sampling
53
Sampling
R
O
W
SRS le random
t
p
u
o
m
i
h
t
s
i
(
w
e
l
samp ment)
ce
a
l
p
e
r
SRSW
R
Raw Data
April 27, 2016
54
Sampling
Raw Data
Cluster/Stratified Sample
55
Hierarchical Reduction
56
Data Preprocessing
Data cleaning
Data reduction
Summary
57
Discretization
58
for a given
continuous attribute by dividing the range of the attribute
into intervals. Interval labels can then be used to replace
actual data values
Concept hierarchies - reduce the data by collecting and
replacing low level concepts (such as numeric values for the
attribute age) by higher level concepts (such as young,
middle-aged, or senior)
Discretization techniques can be categorized based on how
the discretization is performed,
uses class information or
which direction it proceeds (i.e., top-down vs. bottom-up).
If the discretization process uses class information, then we
say it is supervised discretization. Otherwise, it is
unsupervised.
Data Mining: Concepts and
April 27, 2016
Techniques
59
Conti..
60
61
Entropy-Based Discretization
|S|
Ent ( S 1)
|S|
Ent ( S 2)
Ent ( S ) E (T , S )
62
Segmentation by Natural
Partitioning
63
Step 1:
Step 2:
-$351
-$159
Min
msd=1,000
profit
Low=-$1,000
(-$1,000 - 0)
(-$400 - 0)
(-$200 -$100)
(-$100 0)
Max
High=$2,000
($1,000 - $2,000)
(0 -$ 1,000)
(-$4000 -$5,000)
Step 4:
(-$300 -$200)
$4,700
(-$1,000 - $2,000)
Step 3:
(-$400 -$300)
$1,838
(0 - $1,000)
(0 $200)
($1,000 $1,200)
($200 $400)
($1,200 $1,400)
($1,400 $1,600)
($400 $600)
($600 $800)
($800 $1,000)
($2,000 $3,000)
($3,000 $4,000)
($4,000 $5,000)
64
65
country
15 distinct values
province_or_ state
65 distinct
values
3567 distinct values
city
street
April 27, 2016
66