BITS-WASE-DATA MINING-Session-2 PDF

DATA MINING
Lecture Session 2
Delivered by
Dr. K. Satyanarayan Reddy
Data Preprocessing
Aggregation
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation
16 October 2015
Data Mining Session - 2 By Dr. K. Satyanarayan Reddy
Aggregation
Combining two or more attributes (or objects)
into a single attribute (or object)
Purpose
Data reduction
Reduce the number of attributes or objects
Change of scale
Cities aggregated into regions, states, countries, etc
More stable data

Aggregated data tends to have less variability
16 October 2015
Aggregation
Variation of Precipitation in Australia
Standard Deviation of
Average Monthly
Precipitation
16 October 2015
Standard Deviation of
Average Yearly Precipitation
Sampling
Sampling is the main technique employed for data
selection.
It is often used for both the preliminary
investigation of the data and the final data analysis.
Statisticians sample because obtaining the entire set of
data of interest is too expensive or time consuming.
Sampling is used in data mining because processing the
entire set of data of interest is too expensive or time
consuming.
16 October 2015
Sampling
The key principle for effective sampling is the
following:
using a sample will work almost as well as using the
entire data sets, if the sample is representative
A sample is representative if it has approximately the
same property (of interest) as the original set of data
16 October 2015
Types of Sampling
Simple Random Sampling
There is an equal probability of selecting any particular item
Sampling without replacement

As each item is selected, it is removed from the population
Sampling with replacement

Objects are not removed from the population as they are selected for the
sample.
In sampling with replacement, the same object can be picked up more than
once
Stratified sampling
Split the data into several partitions; then draw random samples from each
partition
16 October 2015
Sample Size
8000 points
16 October 2015
2000 Points
500 Points
Sample Size
What sample size is necessary to get at least
one object from each of 10 groups.
16 October 2015
Curse of Dimensionality
When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies
Definitions of density and
distance between points,
which is critical for
clustering and outlier
detection, become less
meaningful
Randomly generate 500 points

Compute difference between max and
min distance between any pair of points
Dimensionality Reduction
Purpose:
Avoid curse of dimensionality
Reduce amount of time and memory required by data
mining algorithms
Allow data to be more easily visualized
May help to eliminate irrelevant features or reduce
noise
Techniques
Principle Component Analysis
Singular Value Decomposition
Others: supervised and non-linear techniques
16 October 2015
11
Dimensionality Reduction: PCA

Goal is to find a projection that captures the
largest amount of variation in data
x2
x1
16 October 2015
12

Find the eigenvectors of the covariance matrix
The eigenvectors define the new space
x2
e
x1
16 October 2015
13
Dimensionality Reduction: ISOMAP

By: Tenenbaum, de Silva,
Langford (2000)
Construct a neighbourhood graph

For each pair of points in the graph, compute the shortest path
distances geodesic distances
16 October 2015
14

Dimensions
Dimensions==206
120
160
10
40
80
16 October 2015
15
Feature Subset Selection

Another way to reduce dimensionality of data
Redundant features
duplicate much or all of the information contained in
one or more other attributes
Example: purchase price of a product and the amount
of sales tax paid
Irrelevant features
contain no information that is useful for the data
mining task at hand
Example: students' ID is often irrelevant to the task of
predicting students' GPA
16 October 2015
16
Feature Subset Selection

Techniques:
Brute-force approch:
Try all possible feature subsets as input to data mining
algorithm
Embedded approaches:
Feature selection occurs naturally as part of the data
mining algorithm
Filter approaches:
Features are selected before data mining algorithm is run
Wrapper approaches:
Use the data mining algorithm as a black box to find best
subset of attributes
16 October 2015
17
Feature Creation
Create new attributes that can capture the
important information in a data set much more
efficiently than the original attributes
Three general methodologies:
Feature Extraction
domain-specific
Mapping Data to New Space

Feature Construction
combining features
16 October 2015
18
Mapping Data to a New Space
Fourier transform
Wavelet transform
Two Sine Waves
16 October 2015
Two Sine Waves + Noise
Frequency
19
Discretization Using Class Labels

Entropy based approach
3 categories for both x and y

16 October 2015
5 categories for both x and y
20
Discretization Without Using Class Labels
Data
Equal frequency
16 October 2015
Equal interval width
K-means
21
Attribute Transformation
A function that maps the entire set of values of a given
attribute to a new set of replacement values such that
each old value can be identified with one of the new
values
Simple functions: xk, log(x), ex, |x|
Standardization and Normalization
16 October 2015
22
Similarity and Dissimilarity

Similarity
Numerical measure of how alike two data objects are.
Is higher when objects are more alike.
Often falls in the range [0,1]
Dissimilarity
Numerical measure of how different are two data objects

Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
Proximity refers to a similarity or dissimilarity

16 October 2015
23
Similarity/Dissimilarity for Simple Attributes

p and q are the attribute values for two data objects.
16 October 2015
24
Euclidean Distance
Euclidean Distance
Where n is the number of dimensions (attributes) and pk and qk are,

respectively, the kth attributes (components) or data objects p and q.
Standardization is necessary, if scales differ.
dist
16 October 2015
2
(
p
q
)
k
k
k 1
25
Euclidean Distance
point
p1
p2
p3
p4
p1
p3
p4
1
p2
0
0
y
2
0
1
1
p1
p1
p2
p3
p4
x
0
2
3
5
0
2.828
3.162
5.099
p2
2.828
0
1.414
3.162
p3
3.162
1.414
0
2
p4
5.099
3.162
2
0
Distance Matrix
16 October 2015
26
Minkowski Distance
Minkowski Distance is a generalization of Euclidean Distance
dist ( | pk qk
k 1
1
r r
|)
Where r is a parameter, n is the number of dimensions

(attributes) and pk and qk are, respectively, the kth
attributes (components) or data objects p and q.
16 October 2015
27
Minkowski Distance: Examples

r = 1. City block (Manhattan, taxicab, L1 norm) distance.
A common example of this is the Hamming distance, which is just the number of
bits that are different between two binary vectors
r = 2. Euclidean distance
r . supremum (Lmax norm, L norm) distance.
This is the maximum difference between any component of the vectors
Do not confuse r with n, i.e., all these distances are defined for
all numbers of dimensions.
16 October 2015
28
Minkowski Distance
point
p1
p2
p3
p4
x
0
2
3
5
y
2
0
1
1
L1
p1
p2
p3
p4
p1
0
4
4
6
p2
4
0
2
4
p3
4
2
0
2
p4
6
4
2
0
L2
p1
p2
p3
p4
p1
p2
2.828
0
1.414
3.162
p3
3.162
1.414
0
2
p4
5.099
3.162
2
0
L
p1
p2
p3
p4
p1
p2
p3
p4
0
2.828
3.162
5.099
0
2
3
5
2
0
1
3
3
1
0
2
5
3
2
0
Distance Matrix
16 October 2015
29
Mahalanobis Distance
1
mahalanobis( p, q) ( p q) ( p q)
is the covariance matrix of the

input data X
j ,k
1 n
( X ij X j )( X ik X k )
n 1 i 1
For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.

16 October 2015
30
Mahalanobis Distance
Covariance Matrix:
0.3 0.2
0
.
2
0
.
3
A: (0.5, 0.5)
B: (0, 1)
A
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
16 October 2015
31
Common Properties of a Distance

Distances, such as the Euclidean distance,
have some well known properties.
1.
2.
3.
d(p, q) 0 for all p and q and d(p, q) = 0 only if

p = q. (Positive definiteness)
d(p, q) = d(q, p) for all p and q. (Symmetry)
d(p, r) d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between points

(data objects), p and q.
A distance that satisfies these properties is

a metric
16 October 2015
32
Common Properties of a Similarity

Similarities, also have some well known
properties.
1.
s(p, q) = 1 (or maximum similarity) only if p = q.
2.
s(p, q) = s(q, p) for all p and q. (Symmetry)
where s(p, q) is the similarity between points (data

objects), p and q.
16 October 2015
33
Similarity Between Binary Vectors
Common situation is that objects, p and q, have only

binary attributes
Compute similarities using the following quantities

M01 = the number of attributes where p was 0 and q was 1
Simple Matching and Jaccard Coefficients

SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)
J = number of 11 matches / number of not-both-zero attributes values

= (M11) / (M01 + M10 + M11)
16 October 2015
34
SMC versus Jaccard: Example

p= 1000000000
q= 0000001001
M01 = 2
M10 = 1
M00 = 7
M11 = 0
(the number of attributes where p was 0 and q was 1)

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7

J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0
16 October 2015
35
Cosine Similarity
If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1 d2) / ||d1|| ||d2|| ,
where indicates vector dot product and || d || is the length of vector d.
Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
cos( d1, d2 ) = .3150
16 October 2015
36
Extended Jaccard Coefficient (Tanimoto)

Variation of Jaccard for continuous or count
attributes
Reduces to Jaccard for binary attributes
16 October 2015
37
Correlation
Correlation measures the linear relationship
between objects
To compute correlation, we standardize data
objects, p and q, and then take their dot product
pk ( pk mean( p)) / std ( p)
qk (qk mean(q)) / std (q)

correlatio n( p, q) p q
16 October 2015
38
Visually Evaluating Correlation
Scatter plots
showing the
similarity from 1
to 1.
16 October 2015
39
General Approach for Combining Similarities
Sometimes attributes are of many different

types, but an overall similarity is needed.
16 October 2015
40
Using Weights to Combine Similarities

May not want to treat all attributes the same.
Use weights wk which are between 0 and 1 and sum
to 1.
16 October 2015
41
Density
Density-based clustering require a notion of
density
Examples:
Euclidean density
Euclidean density = number of points per unit volume
Probability density
Graph-based density
16 October 2015
42
Euclidean Density Cell-based

Simplest approach is to divide region into a
number of rectangular cells of equal volume and
define density as # of points the cell contains
16 October 2015
43
Euclidean Density Center-based

Euclidean density is the number of points within a
specified radius of the point
16 October 2015
44
Cosine Similarity
Additional Problems on Session-2

Q1. This exercise compares and contrasts some similarity and distance measures.
For binary data, the L1 distance corresponds to the Hamming distance; that is, the number of bits that are
different between two binary vectors.
The Jaccard similarity is a measure of the similarity between two binary vectors.
Compute the Hamming distance and the Jaccard similarity between the following two binary vectors.
x = 0101010001
y = 0100011000
Hamming distance = number of different bits = 3
Jaccard Similarity = number of 1-1 matches /( number of bits - number
0-0 matches) = 2/(10 5) = 2 / 5 = 0.4
Q2. For the following vectors, x and y, calculate the indicated similarity or distance measures.
(a) x = (1, 1, 1, 1), y = (2, 2, 2, 2) cosine, correlation, Euclidean
cos(x, y) = 1, corr(x, y) = 0/0 (undefined), Euclidean(x, y) = 2
(b) x = (0, 1, 0, 1), y = (1, 0, 1, 0) cosine, correlation, Euclidean, Jaccard
cos(x, y) = 0, corr(x, y) = 1, Euclidean(x, y) = 2, Jaccard(x, y) = 0
(c) x = (0,1, 0, 1), y = (1, 0,1, 0) cosine, correlation, Euclidean
cos(x, y) = 0, corr(x, y)=0, Euclidean(x, y) = 2
(d) x = (1, 1, 0, 1, 0, 1), y = (1, 1, 1, 0, 0, 1) cosine, correlation, Jaccard
cos(x, y) = 0.75, corr(x, y) = 0.25, Jaccard(x, y) = 0.6
(e) x = (2,1, 0, 2, 0,3), y = (1, 1,1, 0, 0,1) cosine, correlation
cos(x, y) = 0, corr(x, y) = 0
THANK YOU
16 October 2015
47

BITS-WASE-DATA MINING-Session-2 PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

BITS-WASE-DATA MINING-Session-2 PDF

Загружено:

Авторское право:

Доступные форматы

DATA MINING

Data Mining Session - 2 By Dr. K. Satyanarayan Reddy

More stable data

Data Mining Session - 2 By Dr. K. Satyanarayan Reddy

Data Mining Session - 2 By Dr. K. Satyanarayan Reddy

Data Mining Session - 2 By Dr. K. Satyanarayan Reddy

Sampling without replacement

Sampling with replacement

Data Mining Session - 2 By Dr. K. Satyanarayan Reddy

Data Mining Session - 2 By Dr. K. Satyanarayan Reddy

Data Mining Session - 2 By Dr. K. Satyanarayan Reddy

Randomly generate 500 points

Data Mining Session - 2 By Dr. K. Satyanarayan Reddy

Dimensionality Reduction: PCA

Data Mining Session - 2 By Dr. K. Satyanarayan Reddy

Dimensionality Reduction: PCA

Data Mining Session - 2 By Dr. K. Satyanarayan Reddy

Dimensionality Reduction: ISOMAP

Construct a neighbourhood graph

Data Mining Session - 2 By Dr. K. Satyanarayan Reddy

Dimensionality Reduction: PCA

Data Mining Session - 2 By Dr. K. Satyanarayan Reddy

Feature Subset Selection

Data Mining Session - 2 By Dr. K. Satyanarayan Reddy

Feature Subset Selection

Data Mining Session - 2 By Dr. K. Satyanarayan Reddy

Mapping Data to New Space

Data Mining Session - 2 By Dr. K. Satyanarayan Reddy

Mapping Data to a New Space

Two Sine Waves

Two Sine Waves + Noise

Data Mining Session - 2 By Dr. K. Satyanarayan Reddy

Discretization Using Class Labels

3 categories for both x and y

5 categories for both x and y

Data Mining Session - 2 By Dr. K. Satyanarayan Reddy

Discretization Without Using Class Labels

Equal interval width

Data Mining Session - 2 By Dr. K. Satyanarayan Reddy

Similarity and Dissimilarity

Numerical measure of how different are two data objects

Proximity refers to a similarity or dissimilarity

Data Mining Session - 2 By Dr. K. Satyanarayan Reddy

Similarity/Dissimilarity for Simple Attributes

Data Mining Session - 2 By Dr. K. Satyanarayan Reddy

Where n is the number of dimensions (attributes) and pk and qk are,

Standardization is necessary, if scales differ.

Data Mining Session - 2 By Dr. K. Satyanarayan Reddy

Data Mining Session - 2 By Dr. K. Satyanarayan Reddy

Where r is a parameter, n is the number of dimensions

Data Mining Session - 2 By Dr. K. Satyanarayan Reddy

Minkowski Distance: Examples

This is the maximum difference between any component of the vectors

Data Mining Session - 2 By Dr. K. Satyanarayan Reddy

Data Mining Session - 2 By Dr. K. Satyanarayan Reddy

is the covariance matrix of the

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.

Data Mining Session - 2 By Dr. K. Satyanarayan Reddy

Data Mining Session - 2 By Dr. K. Satyanarayan Reddy

Common Properties of a Distance

d(p, q) 0 for all p and q and d(p, q) = 0 only if

where d(p, q) is the distance (dissimilarity) between points

A distance that satisfies these properties is