Академический Документы
Профессиональный Документы
Культура Документы
Techniques
◦ Principle Component Analysis
◦ Singular Value Decomposition
◦ Others: supervised and non-linear techniques
Feature Subset Selection
Another way to reduce dimensionality of data
Redundant features
◦ duplicate much or all of the information contained in
one or more other attributes
◦ Example: purchase price of a product and the
amount of sales tax paid
Irrelevant features
◦ contain no information that is useful for the data
mining task at hand
◦ Example: students' ID is often irrelevant to the task
of predicting students' GPA
Feature Subset Selection
Techniques:
◦ Brute-force approach:
Try all possible feature subsets as input to data
mining algorithm
◦ Embedded approaches:
Feature selection occurs naturally as part of the
data mining algorithm. Algorithms for building
decision tree classifiers, often operate in this
manner
Feature Creation
Create new attributes that can capture the important
information in a data set much more efficiently than the
original attributes
Three general methodologies:
◦ Feature Extraction: The creation of new set of
features from the original raw data is known as
feature extraction.
◦ Mapping Data to New Space: if there are no.of
periodic patterns and a significant amount of noise is
present, then these patterns are hard to detect. Such
patterns are often detected by applying a Fourier
transform.
Discretization and Binarization
It is often necessary to transform a continuous
attribute into a categorical attribute called
Discretization, and both continuous and discrete
attributes may need to be transformed into one or
more binary attributes called Binarization.
Binarization
n
dist ( pk qk ) 2
k 1
Euclidean Distance
p1
point x y
2
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Minkowski Distance
Minkowski Distance is a generalization of Euclidean
Distance
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Distance Matrix
Mahalanobis Distance:
Euclidean distance was used to measure the distance between
people based in two attributes: age and income.
Mahalanobis distance, is useful when attributes are correlated, have
different ranges of values , and the distribution of the data is
approximately normal.
1 n
j,k ( X ij X j )( X ik X k )
n 1 i1
B A: (0.5, 0.5)
B: (0, 1)
A C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
Common Properties of a Distance
Distances, such as the Euclidean distance, have some
well known properties.
correlation( p, q) p q
Exploring Data:
Preliminary Investigation of the data
in order to better understand its
specific characteristics.
What is data exploration?
A preliminary exploration of the data to
better understand its characteristics.
Key motivations of data exploration include
◦ Helping to select the right tool for preprocessing or analysis
◦ Making use of humans’ abilities to recognize patterns
People can recognize patterns not captured by
data analysis tools
Measures of Location: Mean and
Median
The mean is the most common measure of the location
of a set of points.
However, the mean is very sensitive to outliers.
Thus, the median or a trimmed mean is also commonly
used.
Measures of Spread: Range and
Variance