Вы находитесь на странице: 1из 59

Dimensionality Reduction

Classification challenges
• In machine learning, we have many parameters which is
called as features.

• Process becomes tedious when the number of features are


too high during classification.

• Higher the number of feature , harder is to visualize the data

• Possibility to have redundant features which will bring down


the performance of the model
Dimensionality Reduction ?
It is the process of reducing the number of
random variables or features under
consideration, by obtaining a set of principal
variables
Example
• Spam email classification is a classic example
• Involves large number of feature to classify
the mail is spam or not
• There could be redundant features or
ineffective features
• Dimensionality reduction, helps us to identify
the principal variables alone
Dimensionality Reduction Visualization
• 3-D classification problem can be hard to visualize

• 2-D one can be mapped to a simple 2 dimensional space, and a 1-D


problem to a simple line.
Methods of DR
The various methods used for dimensionality reduction
include:

• Principal Component Analysis (PCA)

• Linear Discriminant Analysis (LDA)

• Generalized Discriminant Analysis (GDA)


Principal Component Analysis
PCA
• PCA is a way of identifying patterns in
especially highly dimensional data.
• It highlights similarities and differences in data
in data of high dimension where luxury of
graphical methods are not available.
• Once patterns are found the data can be
compressed without much loss of information.
PCA Applications
It is used in following application areas
• Data classification
• Data reduction
• Data compression
• Data smoothing
PCA Deep Dive
Simple Data Set

Transcription of 2 Genes, Gene 1 and Gene 2 in 6 different


mice
Gene 1 v Mouse's
Measure 2 Genes

2 Dimensional graph,
Gene 1 – X axis and Gene 2 – Y axis
3 Dimensional Graph
4 Dimensional Graph
PCA is the solution
• Let Discuss how PCA can handle 4 or more
Gene.
• We will also see how 4D is converted into 2D
graph
PCA with 2 variables
PCA with 2 Gene
PCA With 2 gene find the center Point
Focus the data

Lets Focus on the plotted data and not on the original data anymore.
X is the center point
Move the data point to the center

Shifting the data didn’t not shift or change the data points position relative to each other
Fit a Line & Rotate it
How PCA decides if this fit is good or not ??
PC1 Calculation
Distance calculation
Distance Criteria

PCA , can minimize the distance b and Maximize distance C


SS distance
Best fit line

Best fit line Is one which has the largest S ditance


PCA 1

Slope is 0.25
Inference

Most of the data are spread across Gene 1 and less data across Gene 2
PCA 1 = 4 parts of Gene 1 + 1 part of Gene 2
For PCA using SVD, length is considered as 1
PCA 1 – Final calculation
PC2 Calculation
PC2 Steps
PCA Final Plot
Rotate the plot
Variation around PCA
Scree Plot
PCA with 3 variables
PCA With 3 variables
Center the data & find Best fit line to
find PCA1
PCA2
PCA3
PCA and its variation
Convert 3D to 2D graph
Strip away everything other than PCA1 and PCA2
PCA with 4 variables
PCA with 4D
• Not possible to draw 4d graph, but however
PCA calculation can be done
SCREE plot for 4D Data

90% of the variation is between PCA1 and PCA2


Principal Component

Maths
Standard Deviation
• The Standard Deviation (SD) of a data set is a
measure of how spread out
Variance
• Variance is another measure of the spread of
data in a data set. In fact it is almost identical
to the standard deviation. The formula is
Covariance
• Covariance is always measured between 2 dimensions

• It is a measure of the relationship between two random variables

• It is essentially a measure of the variance between two variables

• Covariance is measured in units and they are computed by multiplying the


units of the two variables.

• Positive covariance: Indicates that two variables tend to move in the same
direction.

• Negative covariance: Reveals that two variables tend to move in inverse


directions.
Covariance Matrix
Matrix Algebra

Example for one non-eigen and one eigen


vector

4 is Eigen value and [3,2] is eigen vector


PCA Steps
Get the input data
RAW x 2.5 0.5 2.2 1.9 3.1 2.3 2 1 1.5 1.1
DATA y 2.4 0.7 2.9 2.2 3 2.7 1.6 1.1 1.6 0.9
Center the data
CENTERED x 0.69 -1.3 0.39 0.09 1.29 0.49 0.19 -0.8 -0.3 -0.7
DATA y 0.49 -1.2 0.99 0.29 1.09 0.79 -0.3 -0.8 -0.3 -1
Construct the Co-variance matrix
x y
0.61655 0.61544
x
6 4
0.61544 0.71655
y
4 6
Find the Eigen Value and Eigen Vector
Eigen Vectors
0.735179 -0.67787
-0.67787 -0.73518

Eigen Values
0.049083 0
0 1.284028
Rotated Axes
2

1.5

0.5

-0.5

-1

-1.5

-2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Find new Feature vector

Вам также может понравиться