Вы находитесь на странице: 1из 19

TOPIC UNIT-4 1.VARIABLE REDUCTION 2.

PRINCIPAL COMPONENT ANALYSIS SUBMITTED BY

N.MAHESWARI (10UCS29)
C.MALARVIZHI (10UCS30)

VARIABLE REDUCTION
Reductionist Principal component analysis is a variable-reduction procedure.

It is useful when we have obtained data on a large number of variables and believe that there is some redundancy in those variables.
In this case, redundancy means that some of the variables are correlated with one another, possibly because they are measuring the same construct. Because of this redundancy, we believe that it should be possible to reduce the observed variables into a smaller number of principal component s(artificial variables)that will account for most of the variance in the observed variables.

What is Principal Component Analysis?


Principal component analysis (PCA) Reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables Retains most of the sample's information. Useful for the compression and classification of data.

By information we mean the variation present in the sample, given by the correlations between the original variables.
The new variables, called principal components (PCs), are uncorrelated, and are ordered by the amount of the total information each retains.

Why Dimensionality Reduction?


It is so easy and convenient to collect data An experiment Data is not collected only for data mining Data accumulates in an unprecedented speed

Data preprocessing is an important part for effective machine learning and data mining
Dimensionality reduction is an effective approach to downsizing data.

Why PCA?
Most machine learning and data mining techniques may not be effective for high dimensional data Curse of Dimensionality Classification accuracy and efficiency degrade rapidly as the dimension increases. The intrinsic dimension may be small. For example, the number of genes responsible for a certain type of disease may be small.

Visualization:

projection of high-dimensional data onto 2D or 3D.

Data compression: efficient storage and retrieval.


Feature extraction: extract useful features

PRINCIPAL COMPONENT ANALYSIS Reduces the number of predictors by finding the weighted linear combinations of predictors that retain most of the variance in the data set. These are called principal components PCA works only with continuous variables

DIMENSIONALITY REDUCTION

Prerequisite for dimensionality reduction is understanding the data, using e.g. data summaries (min, max, avg, mean, median, stdev) and visualization.

Domain knowledge should always be applied first to remove predictors known to be unapplicable (e.g. height for predicting client income) Correlation analysis, principal component analysis, and binning

CORRELATION ANALYSIS
With many variables there is usually overlap in the covered information. A simple technique for finding redundancies is to look at the correlation coefficients in a correlation matrix. Pairs that have a very strong positive or negative correlation contain a lot of overlap and are subject to removal.

PCA EXAMPLE

Application of Dimensionality Reduction The PCA is required in several scientific fields, such as psychometrics, telecommunications, electroencephalography, stock markets and others.
Customer relationship management
Text mining Image retrieval Microarray data analysis Protein classification Face recognition Handwritten digit recognition Intrusion detection

Major Techniques of Dimensionality Reduction

Feature selection Definition Objectives Feature Extraction (reduction) Definition Objectives

Differences between the two techniques

Feature Selection
Definition A process that chooses an optimal subset of features according to a objective function Objectives To reduce dimensionality and remove noise

To improve mining performance


Speed of learning Predictive accuracy Simplicity and comprehensibility of mined results

Feature extraction: Feature reduction refers to the mapping of the original highdimensional data onto a lower dimensional space Given a set of data points of p variables {x1,x2,..xn} Criterion for feature reduction can be different based on different problem settings.

Unsupervised setting: minimize the information loss


Supervised setting: maximize the class discrimination

Feature Reduction vs. Feature Selection


Feature reduction All original features are used The transformed features are linear combinations of the original features Feature selection Only a subset of the original features are selected Continuous versus discrete

Features:
It is computationally inexpensive,

It can be applied to ordered and unordered attributes,


It can handle sparse data and skewed data. Multidimensional data of more than two dimensions can be handled.

ALGORITHM:
The PCA algorithm consists of 5 main steps:

Subtract the mean: subtract the mean from each of the data dimensions. The mean subtracted is the average across each dimension. This produces a data set whose mean is zero.
Calculate the covariance matrix:

Where is a matrix which each entry is the result of calculating the covariance between two separate dimensions.

Calculate the eigenvectors and eigenvalues of the covariance matrix.

Choose components and form a feature vector: once eigenvectors are found from the covariance matrix, the next step is to order them by eigenvalue, highest to lowest. So that the components are sorted in order of significance. The number of eigenvectors that you choose will be the number of dimensions of the new data set. The objective of this step is construct a feature vector (matrix of vectors). From the list of eigenvectors take the eigenvectors selected and form a matrix with them in the columns: Feature Vector = (eig_1, eig_2, ..., eig_n)

Derive the new data set. Take the transpose of the Feature Vector and multiply it on the left of the original data set, transposed: Final Data = RowFeatureVector x RowDataAdjusted where RowFeatureVector is the matrix with the eigenvectors in the columns transposed (the eigenvectors are now in the rows and the most significant are in the top) and RowDataAdjusted is the mean-adjusted data transposed (the data items are in each column, with each row holding a separate dimension).

Вам также может понравиться