Dim Red

Dimensionality Reduction
Classification challenges
• In machine learning, we have many parameters which is
called as features.
• Process becomes tedious when the number of features are

too high during classification.
• Higher the number of feature , harder is to visualize the data
• Possibility to have redundant features which will bring down

the performance of the model
Dimensionality Reduction ?
It is the process of reducing the number of
random variables or features under
consideration, by obtaining a set of principal
variables
Example
• Spam email classification is a classic example
• Involves large number of feature to classify
the mail is spam or not
• There could be redundant features or
ineffective features
• Dimensionality reduction, helps us to identify
the principal variables alone
Dimensionality Reduction Visualization
• 3-D classification problem can be hard to visualize
• 2-D one can be mapped to a simple 2 dimensional space, and a 1-D

problem to a simple line.
Methods of DR
The various methods used for dimensionality reduction
include:
• Principal Component Analysis (PCA)
• Linear Discriminant Analysis (LDA)
• Generalized Discriminant Analysis (GDA)

Principal Component Analysis
PCA
• PCA is a way of identifying patterns in
especially highly dimensional data.
• It highlights similarities and differences in data
in data of high dimension where luxury of
graphical methods are not available.
• Once patterns are found the data can be
compressed without much loss of information.
PCA Applications
It is used in following application areas
• Data classification
• Data reduction
• Data compression
• Data smoothing
PCA Deep Dive
Simple Data Set
Transcription of 2 Genes, Gene 1 and Gene 2 in 6 different

mice
Gene 1 v Mouse's
Measure 2 Genes
2 Dimensional graph,
Gene 1 – X axis and Gene 2 – Y axis
3 Dimensional Graph
4 Dimensional Graph
PCA is the solution
• Let Discuss how PCA can handle 4 or more
Gene.
• We will also see how 4D is converted into 2D
graph
PCA with 2 variables
PCA with 2 Gene
PCA With 2 gene find the center Point
Focus the data
Lets Focus on the plotted data and not on the original data anymore.
X is the center point
Move the data point to the center
Shifting the data didn’t not shift or change the data points position relative to each other
Fit a Line & Rotate it
How PCA decides if this fit is good or not ??
PC1 Calculation
Distance calculation
Distance Criteria
PCA , can minimize the distance b and Maximize distance C

SS distance
Best fit line
Best fit line Is one which has the largest S ditance

PCA 1
Slope is 0.25
Inference
Most of the data are spread across Gene 1 and less data across Gene 2
PCA 1 = 4 parts of Gene 1 + 1 part of Gene 2
For PCA using SVD, length is considered as 1
PCA 1 – Final calculation
PC2 Calculation
PC2 Steps
PCA Final Plot
Rotate the plot
Variation around PCA
Scree Plot
PCA With 3 variables
Center the data & find Best fit line to
find PCA1
PCA2
PCA3
PCA and its variation
Convert 3D to 2D graph
Strip away everything other than PCA1 and PCA2
PCA with 4D
• Not possible to draw 4d graph, but however
PCA calculation can be done
SCREE plot for 4D Data
90% of the variation is between PCA1 and PCA2

Principal Component
Maths
Standard Deviation
• The Standard Deviation (SD) of a data set is a
measure of how spread out
Variance
• Variance is another measure of the spread of
data in a data set. In fact it is almost identical
to the standard deviation. The formula is
Covariance
• Covariance is always measured between 2 dimensions
• It is a measure of the relationship between two random variables
• It is essentially a measure of the variance between two variables
• Covariance is measured in units and they are computed by multiplying the

units of the two variables.
• Positive covariance: Indicates that two variables tend to move in the same
direction.
• Negative covariance: Reveals that two variables tend to move in inverse

directions.
Covariance Matrix
Matrix Algebra
Example for one non-eigen and one eigen

vector
4 is Eigen value and [3,2] is eigen vector

PCA Steps
Get the input data
RAW x 2.5 0.5 2.2 1.9 3.1 2.3 2 1 1.5 1.1
DATA y 2.4 0.7 2.9 2.2 3 2.7 1.6 1.1 1.6 0.9
Center the data
CENTERED x 0.69 -1.3 0.39 0.09 1.29 0.49 0.19 -0.8 -0.3 -0.7
DATA y 0.49 -1.2 0.99 0.29 1.09 0.79 -0.3 -0.8 -0.3 -1
Construct the Co-variance matrix
x y
0.61655 0.61544
x
6 4
0.61544 0.71655
y
4 6
Find the Eigen Value and Eigen Vector
Eigen Vectors
0.735179 -0.67787
-0.67787 -0.73518
Eigen Values
0.049083 0
0 1.284028
Rotated Axes
2
1.5
0.5
-0.5
-1
-1.5
-2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Find new Feature vector

Dim Red

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Dim Red

Загружено:

Авторское право:

Доступные форматы

Dimensionality Reduction

• Process becomes tedious when the number of features are

• Higher the number of feature , harder is to visualize the data

• Possibility to have redundant features which will bring down

• 2-D one can be mapped to a simple 2 dimensional space, and a 1-D

• Principal Component Analysis (PCA)

• Linear Discriminant Analysis (LDA)

• Generalized Discriminant Analysis (GDA)

Transcription of 2 Genes, Gene 1 and Gene 2 in 6 different

PCA , can minimize the distance b and Maximize distance C

Best fit line Is one which has the largest S ditance

90% of the variation is between PCA1 and PCA2

• It is a measure of the relationship between two random variables

• It is essentially a measure of the variance between two variables

• Covariance is measured in units and they are computed by multiplying the

• Negative covariance: Reveals that two variables tend to move in inverse

Example for one non-eigen and one eigen

4 is Eigen value and [3,2] is eigen vector

Вам также может понравиться