67 views

Uploaded by bioinfokarthi

- Principal Component Analysis Primer
- PCA
- The Social Semantics of LiveJournal FOAF: Structure and Change from 2004 to 2005
- NMF Tutorial
- Detection of Human Actions From a Single Example - SEO, Milanfar - Proceedings of International Conference on Computer Vision - 2009
- Bonnet 2009 JI Eb169 SupFig
- ms2026_final.pdf
- Probability G.M.F
- Automated Bone Landmarks Prediction
- EtologiaCaos
- 2015 a Guide to Statistical Analysis in Microbial Ecology- A Community-focused, Living Review of Multivariate Data Analyses
- Sidhu Mission Statements
- 10.1.1.91
- b 34740971
- Face Recognition
- 1-s2.0-S0950329310000777-main
- qtb assignment
- g 04401051055
- sensors-16-01409
- paper2

You are on page 1of 2

© 2008 Nature Publishing Group http://www.nature.com/naturebiotechnology

Markus Ringnér

Principal component analysis is often incorporated into genome-wide expression studies, but what is it and how can it be used to

explore high-dimensional data?

the life sciences gather data for many more

variables per sample than the typical number

geometrical interpretations of the data. To

allow for such interpretations, imagine that

the microarrays in our example measured the

Dimensional reduction and visualization

We can reduce the dimensionality of our two-

dimensional expression profiles to a single

of samples assayed. For instance, DNA micro- expression levels of only two genes, GATA3 dimension by projecting each sample onto

arrays and mass spectrometers can measure and XBP1. This simplifies plotting the breast the first principal component (Fig. 1c). This

levels of thousands of mRNAs or proteins in cancer samples according to their expression one-dimensional representation of the data

hundreds of samples. Such high-dimensional- profiles, which in this case consist of two num- retains the separation of the samples accord-

ity makes visualization of samples difficult and bers (Fig. 1a). Breast cancer samples are clas- ing to estrogen receptor status. The projection

limits simple exploration of the data. sified as being either positive or negative for of the data onto a principal component can

Principal component analysis (PCA) is a the estrogen receptor, and I have selected two be viewed as a gene-like pattern of expression

mathematical algorithm that reduces the dimen- genes whose expression is known to correlate across samples, and the normalized pattern is

sionality of the data while retaining most of the with estrogen receptor status3. sometimes called an eigengene. So for each

variation in the data set1. It accomplishes this PCA identifies new variables, the principal sample-like component, PCA reveals a cor-

reduction by identifying directions, called prin- components, which are linear combinations responding gene-like pattern containing the

cipal components, along which the variation in of the original variables. The two principal same variation in the data as the component.

the data is maximal. By using a few components, components for our two-dimensional gene Moreover, provided that data are standardized

each sample can be represented by relatively few expression profiles are shown in Figure 1b. It so that samples have zero average expression,

numbers instead of by values for thousands of is easy to see that the first principal component the eigengenes are eigenvectors to the covari-

variables. Samples can then be plotted, making it is the direction along which the samples show ance matrix of the samples.

possible to visually assess similarities and differ- the largest variation. The second principal So far we have used data for only two genes

ences between samples and determine whether component is the direction uncorrelated to the to illustrate how PCA works, but what happens

samples can be grouped. first component along which the samples show when thousands of genes are included in the

Saal et al.2 used microarrays to measure the the largest variation. If data are standardized analysis? Let’s apply PCA to the 8,534 probes

expression of 27,648 genes in 105 breast tumor such that each gene is centered to zero aver- on the microarrays with expression measure-

samples. I will use this gene expression data set, age expression level, the principal components ments for all 105 samples. To get a view of the

which is available through the Gene Expression are normalized eigenvectors of the covariance dimensionality of the data, we begin by look-

Omnibus database (accession no. GSE5325), matrix of the genes and ordered according to ing at the proportion of the variance present

to illustrate how PCA can be used to represent how much of the variation present in the data in all genes contained within each principal

samples with a smaller number of variables, they contain. Each component can then be component (Fig. 1d). Note that although the

visualize samples and genes, and detect domi- interpreted as the direction, uncorrelated to first few components have more variance than

nant patterns of gene expression. My aim with previous components, which maximizes the later components, the first two components

this example is to leave you with an idea of how variance of the samples when projected onto retain only 22% of the original variance and

PCA can be used to explore data sets in which the component. Here, genes were centered in 63 components are needed to retain 90% of

thousands of variables have been measured. all examples before PCA was applied to the the original variance. On the other hand, 104

data. The first component in Figure 1b can components are enough to retain all the origi-

Principal components be expressed in terms of the original variables nal variance—a much smaller number than

Although understanding the details underly- as PC1 = 0.83 × GATA3 + 0.56 × XBP1. The the original 8,534 variables. When the number

ing PCA requires knowledge of linear alge- components have a sample-like pattern with of variables is larger than the number of sam-

bra1, the basics can be explained with simple a weight for each gene and are sometimes ples, PCA can reduce the dimensionality of the

referred to as eigenarrays. Methods related to samples to, at most, the number of samples,

Markus Ringnér is in the Division of Oncology, PCA include independent component analysis, without loss of information.

Department of Clinical Sciences, Barngatan 2, which is designed to identify components that To see whether the variation retained in the

Lund University, 221 85, Lund, Sweden. are statistically independent from each other, first two components contains relevant infor-

e-mail: markus.ringner@med.lu.se rather than being uncorrelated4. mation about the breast cancer samples, each

PRIMER

2

PC izing a data set. As the principal components

2 2

are uncorrelated, they may represent different

PC

All

1

0 0 aspects of the samples. This suggests that PCA

XBP1

XBP1

ER–

−2 −2

can serve as a useful first step before clustering

ER+ or classification of samples. However, decid-

−4 −4 ing how many and which components to use

−4 −2 0 2 −4 −2 0 2 −4 −2 0 2 in the subsequent analysis is a major chal-

GATA3 GATA3 Projection onto PC1 lenge that can be addressed in several ways1.

d

© 2008 Nature Publishing Group http://www.nature.com/naturebiotechnology

14

e f For example, one can use components that

P1

40 40

correlate with a phenotype of interest9 or use

Proportion of variance (%)

XB

12

Projection onto PC2

10

8 0 CCNB2 0 variation in the data11. PCA results depend

6 −20 −20 critically on preprocessing of the data and

4

−40 −40

on selection of variables. Thus, inspecting

2 PCA plots can potentially provide insights

0 −60 −60

0 20 40 60 80 100 −40 −20 0 20 40 60 −40 −20 0 20 40 60

into different choices of preprocessing and

Principal component Projection onto PC1 Projection onto PC1 variable selection.

PCA is often implemented using the sin-

Figure 1 Principal component analysis (PCA) of a gene expression data set. (a) Each dot represents gular value decomposition (SVD) of the

a breast cancer sample plotted against its expression levels for two genes. (In a–c, e, samples are

data matrix1. The sample-like eigenarray

colored according to estrogen receptor (ER) status: ER+, red; ER–, black). (b) PCA identifies the two

directions (PC1 and PC2) along which the data have the largest spread. (c) Samples plotted in one and the gene-like eigengene patterns are

dimension using their projections onto the first principal component (PC1) for ER +, ER– and all samples both uncovered simultaneously by SVD10,12.

separately. (d) The variance of the principal components when PCA is applied to all 8,534 genes with Many applications beyond dimensional

expression levels for all samples. (e) PCA biplot with samples plotted in two dimensions using their reduction, classification and clustering have

projections onto the first two principal components, and two genes plotted using their weights for the taken advantage of global representations of

components (green points). The scale shown is for the samples; for the genes, the scale should be

expression profiles generated by this decom-

divided by 950. (f) Samples plotted as in e but colored according to ERBB2 status (blue, ERBB2 + ;

brown, ERBB2 – ; green, unknown).

position. Applications include identifying

patterns that correlate with experimental

artifacts and filtering them out6, estimating

sample is projected onto these components in As the principal components have a sam- missing data, associating genes and expres-

Figure 1e. The result is that the dimensional- ple-like pattern with a weight for each gene, sion patterns with activities of regulators

ity can be reduced from the number of genes we can use the weights to visualize each gene and helping to uncover the dynamic archi-

down to two dimensions, while still retaining in the PCA plot8. Most genes will be close tecture of cellular phenotypes7,10,12. The

information that separates estrogen recep- to the origin in such a biplot of genes and rapid growth in technologies that generate

tor–positive from estrogen receptor–negative samples, whereas the genes having the larg- high-dimensional molecular biology data

samples. Estrogen receptor status is known est weights for the displayed components will likely provide many new applications

to have a large influence on the gene expres- will extend out in their respective direc- for PCA in the years to come.

sion profiles of breast cancer cells3. However, tions9. Biplots provide one way to use the

ACKNOWLEDGMENTS

note that PCA did not generate two separate correspondence between the gene-like and

I wish to thank the Swedish Foundation for Strategic

clusters (Fig. 1e), indicating that discover- sample-like patterns revealed by PCA to Research for support through the Lund Strategic Centre

ing unknown groups using PCA is difficult. identify groups of genes having expression for Clinical Cancer Research (CREATE Health).

Moreover, gene expression profiles can also levels characteristic for a group of samples.

be used to classify breast cancer tumors As an example, two genes with large weights 1. Jolliffe, I.T. Principal Component Analysis (Springer,

New York, 2002).

according to whether they have gained DNA are displayed in Figure 1e.

2. Saal, L.H. et al. Proc. Natl. Acad. Sci. USA 104, 7564–

copies of ERBB2 or not3 and this informa- 7569 (2007).

tion is lost when reducing this data set to the Applications in computational biology 3. Perou, C.M. et al. Nature 406, 747–752 (2000).

4. Comon, P. Signal Process. 36, 287–314 (1994).

first two principal components (Fig. 1f). This An obvious application of PCA is to explore 5. Coombes, K.R. et al. Nat. Biotechnol. 23, 291–292

reminds us that PCA is designed to identify high-dimensional data sets, as outlined above. (2005).

directions with the largest variation and not Most often, three-dimensional visualizations 6. Nielsen, T.O. et al. Lancet 359, 1301–1307 (2002).

7. Li, C.M. & Klevecz, R.R. Proc. Natl. Acad. Sci. USA 103,

directions relevant for separating classes are used for such explorations, and samples 16254–16259 (2006).

of samples. Also, it is important to bear in are either projected onto the components, as 8. Gabriel, K.R. Biometrika 58, 453–467 (1971).

9. Landgrebe, J. Wurst, W. & Welzl, G. Genome Biol. 3,

mind that much of the variation in data from in the examples here, or plotted according

RESEARCH0019 (2002).

high-throughput technologies may be due to to their correlation with the components10. 10. Alter, O., Brown, P.O. & Botstein, D. Proc. Natl. Acad.

systematic experimental artifacts5–7, result- As much information will typically be lost Sci. USA 97, 10101–10106 (2000).

11. Khan, J. et al. Nat. Med. 7, 673–679 (2001).

ing in dominant principal components that in two- or three-dimensional visualizations, 12. Holter, N.S. et al. Proc. Natl. Acad. Sci. USA 97, 8409–

correlate with artifacts. it is important to systematically try different 8414 (2000).

- Principal Component Analysis PrimerUploaded byAri David Paul
- PCAUploaded byPritam Bardhan
- The Social Semantics of LiveJournal FOAF: Structure and Change from 2004 to 2005Uploaded bySam Chase
- NMF TutorialUploaded byMuhammad Rizwan
- Detection of Human Actions From a Single Example - SEO, Milanfar - Proceedings of International Conference on Computer Vision - 2009Uploaded byzukun
- Bonnet 2009 JI Eb169 SupFigUploaded bymariebonnet78
- ms2026_final.pdfUploaded byelyesyoussef
- Probability G.M.FUploaded byClaudio Amaury C. Jiménez
- Automated Bone Landmarks PredictionUploaded bymemguoyan
- EtologiaCaosUploaded byBen Phear
- 2015 a Guide to Statistical Analysis in Microbial Ecology- A Community-focused, Living Review of Multivariate Data AnalysesUploaded byMarcelina Mendoza Salazar
- Sidhu Mission StatementsUploaded byCRNIPAS
- 10.1.1.91Uploaded byNuttawut Charoenchai
- b 34740971Uploaded byJe Rivas
- Face RecognitionUploaded byNelamalli Sandy
- 1-s2.0-S0950329310000777-mainUploaded byBernat Og
- qtb assignmentUploaded byapi-281610624
- g 04401051055Uploaded byInternational Journal of computational Engineering research (IJCER)
- sensors-16-01409Uploaded byHuong Ninh
- paper2Uploaded bysid
- 03 Pre ProcessingUploaded byJyothi Jyo
- Pro Genesis LC MS v2.6 User GuideUploaded byMohit Singh
- jamshidian.pdfUploaded byseth_tolev
- ie302069qUploaded byNúbia Batista
- 10.1.1.19Uploaded byandhra2003
- fnhum-08-00051Uploaded byabijosh
- Pca 3Uploaded bysitaram_1
- Full_Pattern_Cluster_Poster.pdfUploaded bySaputra
- Random Variables Tarea TeoríaUploaded byMarco Espinosa
- en_Tanagra_hac_pca.pdfUploaded byCherry Ratisoontorn

- Soya Sauce Production DetailsUploaded byNirmal Sharma
- Unit 3 Internal Combustion EnginesUploaded byMatthew Smith
- Analisis Soalan Percubaan Sains PMR 2011Uploaded byErmie Dharlya
- AAPG Unpad SC - Course and WorkshopUploaded byludy putra
- IRIS BCR200DTP Driver Installation GuideUploaded byHairi Ng
- Enumeration of Bacteria in Milk Samples and Presumptive Test for ColiformsUploaded byAlda Yana
- Pulsed Neutron Logging in Tight Gas Sand ReservoirUploaded byhunterextreme
- Prem Chowdhry CVUploaded byPrem Chowdhry
- Gate ServiceUploaded byCowdrey Ilan
- Airlines NDT Conf Digital-RadiographyUploaded byRamakrishnan AmbiSubbiah
- Power Consumption in Stirred Tanks Provided with Multiple Pitched-Blade TurbinesUploaded byshanto4
- CE121F_FieldWork1Uploaded byIdate Patrick
- 5949_4f_ELGEF_Plus_lowendUploaded byKmrul Hisyam
- jemh104Uploaded byahadaziz
- 54997729 WinCC Flexible Time Switch EnUploaded bySaid Boubker
- Transport Phenomena Data Companio.pdfUploaded byLavanya Chandrashekar
- Sample Wonderlic Test Questions AnswersUploaded bykris_acc85
- Satellite TransmissionUploaded bySam Ficher
- 12_2devphasedarrayUploaded byReza Gholami
- BiochemistryUploaded bynikkikeswani
- White Paper - Net Optics Load Balancing Solutions: A Brief Guide to Understanding Your OptionsUploaded byMichael Tunk
- Heron TrianglesUploaded byhumejias
- 01-NORA_Steam Power Plant.pdfUploaded bymasriza
- 16 Advanced Cutting Tool MaterialsUploaded byPRASAD326
- Mechatronics LabUploaded bykbmn2
- Design LabUploaded byVictor Evangelista
- LSA 52.3 LeroyUploaded byWilmer
- Dbms Ob QuestionsUploaded byLalit Thomria
- elit software hvacUploaded byLorenz Sanz
- lect01Uploaded byashmhd