Академический Документы
Профессиональный Документы
Культура Документы
Material do livro
Introduction to Data Mining
Tan, Steinbach, Kumar
Fabio R. Cerqueira
What is Data?
An attribute is a property or
characteristic of an object
Examples: eye color of a
person, temperature, etc.
Attribute is also known as
variable, field, characteristic,
or feature
Objects
A collection of attributes
describes an object
Object is also known as
record, point, case, sample,
entity, or instance
Attributes
Taxable
Income
Yes
Single
125K
No
Married
100K
No
Single
70K
Yes
Married
120K
No
Divorced 95K
No
Married
Yes
Divorced 220K
No
Single
85K
No
Married
75K
60K
Attribute Values
Attribute
Distinction
Types of Attributes
Ordinal
Interval
Ratio
Distinctness:
Order:
Addition:
Multiplication:
Record
Data Matrix
Document Data
Transaction Data
Graph
Molecular Structures
Ordered
Spatial Data
Temporal Data
Sequential Data
Record Data
Data
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
Data Matrix
Document Data
Each
Transaction Data
A special
Items
2
3
4
5
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
10
Graph Data
Examples:
2
1
5
2
5
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
<a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
<a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
11
Chemical Data
Benzene
Molecule: C6H6
12
Ordered Data
Sequences
of transactions
Items/Events
An element of
the sequence
13
Ordered Data
14
Ordered Data
Spatio-Temporal
Data
Average Monthly
Temperature of
land and ocean
15
Data Quality
What
Noise
Noise
Outliers
Outliers
18
Missing Values
Reasons
missing values
Duplicate Data
Data
Examples:
cleaning
Data Preprocessing
Aggregation
Sampling
Dimensionality
Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation
21
Aggregation
Combining
Purpose
Data reduction
Change of scale
22
Aggregation
Variation of Precipitation in Australia
Sampling
24
Sampling
25
Types of Sampling
Stratified sampling
Split the data into several partitions; then draw random samples
from each partition
26
Sample Size
8000 points
2000 Points
500 Points
27
Curse of Dimensionality
When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies
Dimensionality Reduction
Purpose:
30
A x = x
32
Redundant
features
features
33
Brute-force approach:
Try
Embedded approaches:
Feature selection occurs naturally as part of the data mining
algorithm
Filter approaches:
Wrapper approaches:
Use the data mining algorithm as a black box to find best subset
of attributes
34
35
36
H(c)=-[7/10log2(7/10)+3/10log2(3/10)]=0.88.
H(c|Refund)=7/10 * -[4/7log2(4/7)+3/7log2(3/7)]
+ 3/10 * -[3/3log2(3/3)+0/3log2(0/3)]=0.69.
IG(Refund) = 0.88 0.69 = 0.19.
37
38
Feature Creation
39
based approach
Data
Equal frequency
K-means
41
Attribute Transformation
A function
42
44
Euclidean Distance
Euclidean Distance
dist =
( pk q k )
k =1
45
Euclidean Distance
3
point
p1
p2
p3
p4
p1
p3
p4
1
p2
0
0
y
2
0
1
1
p1
p1
p2
p3
p4
x
0
2
3
5
0
2.828
3.162
5.099
p2
2.828
0
1.414
3.162
p3
3.162
1.414
0
2
p4
5.099
3.162
2
0
Distance Matrix
46
Minkowski Distance
1
r r
dist =( pk q k )
k =1
47
r = 2. Euclidean distance
48
Minkowski Distance
point
p1
p2
p3
p4
x
0
2
3
5
y
2
0
1
1
L1
p1
p2
p3
p4
p1
0
4
4
6
p2
4
0
2
4
p3
4
2
0
2
p4
6
4
2
0
L2
p1
p2
p3
p4
p1
p2
2.828
0
1.414
3.162
p3
3.162
1.414
0
2
p4
5.099
3.162
2
0
L
p1
p2
p3
p4
p1
p2
p3
p4
0
2.828
3.162
5.099
0
2
3
5
2
0
1
3
3
1
0
2
5
3
2
0
Distance Matrix
49
Mahalanobis Distance
1
mahalanobis ( p , q )=( pq )
( pq )
1
j,k=
( X X j )( X ik X k )
n1 i=1 ij
Mahalanobis Distance
Covariance Matrix:
0.3 0.2
=
0.2 0.3
A: (0.5, 0.5)
B: (0, 1)
A
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
51
2.
3.
2.
53
Cosine Similarity
56
Variation
attributes
Reduces to Jaccard for binary attributes
57
Correlation
58
Scatter plots
showing the
similarity from
1 to 1.
59
60
61
Density
Density-based
density
Examples:
Euclidean density
Probability density
62
63