Вы находитесь на странице: 1из 15

Working with Affymetrix data: estrogen, a 2x2 factorial design example

0.) How to install estrogen data library: down load estrogen by setting Packages -> Select repositories -> bioC experiment and

Packages -> Install packages -> estrogen Packages -> Install packages -> affydata

Data "The investigators in this experiment were interested in the effect of estrogen on the genes in ER+ breast cancer cells over time. After serum starvation of all eight samples, they exposed four samples to estrogen, and then measured mRNA transcript abundance after 10 hours for two samples and 48 hours for the other two. They left the remaining four samples untreated, and measured mRNA transcript abundance at 10 hours for two samples, and 48 hours for the other two. Since there are two factors in this experiment (estrogen and time), each at two levels (present or absent,10 hours or 48 hours), this experiment is said to have a 2x2 factorial design."
The table below describes the experimental conditions for each of the eight arrays. Name Abs10.1 Abs10.2 Pres10.1 Pres10.2 Abs48.1 Abs48.2 Pres48.1 Pres48.2 FileName low10-1.cel low10-2.cel high10-1.cel high10-2.cel low48-1.cel low48-2.cel high48-1.cel high48-2.cel Target EstAbsent10 EstAbsent10 EstPresent10 EstPresent10 EstAbsent48 EstAbsent48 EstPresent48 EstPresent48

The file (shown as a table above) is known as the "RNA Targets" file in affylmGUI. It should be stored in tab-delimited text format and it can be created either in a spreadsheet program (such as Excel) or in a text editor. The column headings must appear exactly as above. Each chip should be given a unique name (in the Name column) to be used to create default plotting labels. Here, short names are used to ensure that the labels fit in the space available on the plots. The Affymetrix CEL file name should be listed for each chip in in the FileName column. The Target column tells affylmGUI which chips are replicates. By using the same "Target" name for the first two rows in the table, we are telling affylmGUI that these two CEL files represent replicate chips for the experimental condition (Estrogen Absent, Time 10 hours). Note that we could use a different Targets file for this analysis in which the time effect was ignored if we only wanted to compare Estrogen Absent and Estrogen Present. This would simply require removing the 10's and 48's from the Target column.

The data for this worked example is available at http://bioinf.wehi.edu.au/affylmGUI/DataSets.html. 1.) Preliminaries.
To go through this exercise, you need to have installed R>=2.13.0, the libraries Biobase, affy, hgu95av2, hgu95av2cdf, and vsn from the Bioconductor. library(affy) library(estrogen) library(vsn) library(affydata) require(affyQCReport) library(affyQCReport)

2.) Load the data.


a. Find the directory where the example cel files are. The directory path should end in .../R/library/estrogen/extdata. datadir = system.file("extdata", package = "estrogen") datadir #prints where estrogen data folder is located. dir(datadir) #lists all the files in data folder setwd(datadir) #set working directory to the data directory getwd() #prints current working directory The function system.file here is used to find the subdirectory extdata of the estrogen package on your computers hard disk. To use your own data, set datadir to the appropriate path instead. b. The file estrogen.txt contains information on the samples that were hybridized onto the arrays. Look at it in a text editor. Again, to use your own data, you need to prepare a similar file with the appropriate information on your arrays and samples. To load it into a phenoData object pd = read.AnnotatedDataFrame ("estrogen.txt", header = TRUE, row.names = 1) #not working? Created empty object. est.pdata<- read.table("estrogen.txt",header=T,row.names=1) est.label=data.frame(labelDescription=c("estrogen status", "time (hours)")) est.label getSlots("AnnotatedDataFrame") pd=new("AnnotatedDataFrame",data=est.pdata,varMetadata=est.label) pData(pd) pd@data > pData(pd) estrogen time.h absent 10 absent 10 present 10 present 10 absent 48 absent 48 present 48 present 48

low10-1.cel low10-2.cel high10-1.cel high10-2.cel low48-1.cel low48-2.cel high48-1.cel high48-2.cel

phenoData objects are where the Bioconductor stores information about samples, for example, treatment conditions in a cell line experiment or clinical or histopathological characteristics of tissue biopsies. The header option lets the read. AnnotatedDataFrame function know that the first line in the file contains column headings, and the row.names option indicates that the first column of the file contains the row names. c. Load the data from the CEL files as well as the phenotypic data into an AffyBatch object. a = ReadAffy(filenames = rownames(pData(pd)), phenoData = pd, verbose = TRUE) a class(a) getSlots("AffyBatch") The function ReadAffy is quite flexible. It lets you specify the filenames, phenotype, and MIAME (Minimum Information About a Microarray Experiment) information. You can enter them by reading files (see the help file) or widgets (you need to have the tkWidgets package installed and working) R library(tkWidgets) Data <- ReadAffy(widget=TRUE) ##read data in working directory This function call will pop-up a file browser widget that provides an easy way of choosing cel files. If you dont have packages in your R, you could install it using the following commands. Data <- ReadAffy() #will read all the cel file in your working directory #How do you get rid of one file from the Data? source("http://www.bioconductor.org/getBioC.R") getBioC(pkg1,pkg2 ) # pkg1 and pkg2 are names of package you want to install. ?read.affybatch ?read.AnnotatedDataFrame ?read.MIAME Quality Assessment (Chapter 3 of our text book) require(affyQCReport) library(affyQCReport) QCReport(a,file="ExampleQC.pdf")

probe oligonucleotides of 25 base pair length used to probe RNA targets. perfect match probes intended to match perfectly the target sequence. PM intensity value read from the perfect matches. mismatch the probes having one base mismatch with the target sequence intended to account for non-specific binding. MM intensity value read from the mis-matches. probe pair a unit composed of a perfect match and its mismatch. affyID an identification for a probe set (which can be a gene or a fraction of a gene) represented on the array. probe pair set PMs and MMs related to a common affyID. CEL files contain measured intensities and locations for an array that has been hybridized. CDF file contain the information relating probe pair sets to locations on the array.

Classes
AffyBatch is the main class in affy package. There are three other auxiliary classes that we also describe in this Section.

AffyBatch
The AffyBatch class has slots to keep all the probe level information for a batch of Cel files, which usually represent an experiment. It also stores phenotypic and MIAME information as does the ExpressionSet class in the Biobase package (the base package for Bioconductor). In fact, AffyBatch extends ExpressionSet. The assayData slot contains the a matrix with the columns representing the intensities read from the different arrays. The rows represent the cel intensities for all position on the array. The cel intensity with physical coordinates (x, y) will be in row i = x + nrow (y 1). The ncol and nrow slots contain the physical rows of the array. Notice that this is different from the dimensions of the exprs matrix. The number of row of the exprs matrix is equal to ncolnrow. We advice the use of the functions xy2indices and indices2xy to shuttle from X/Y coordinates to indices. For compatibility with previous versions the accessor method intensity exists for obtaining the exprs slot. The cdfName slot contains the necessary information for the package to find the locations of the probes for each probe set.

ProbeSet
The ProbeSet class holds the information of all the probes related to an affyID. The components are pm and mm. The method probeset extracts probe sets from AffyBatch objects. It takes as arguments an AffyBatch object and a vector of affyIDs and returns a list of objects of class ProbeSet gn <- geneNames(a) ps <- probeset(a, gn[1:2]) show(ps[[1]]) ProbeSet object: id=100_g_at pm= 16 probes x 8 chips class(ps[[1]]) getSlots("ProbeSet") ps[[1]]@pm The pm and mm methods can be used to extract these matrices (see below). This function is general in the way it defines a probe set. The default is to use the definition of a probe set given by Affymetrix in the CDF file. ps pm(ps[[1]]) mm(ps[[1]]) pm(ps[[2]]) mm(ps[[2]])

To look at the data set ps.orig <- probeset(a, genenames =c("100_g_at", "1000_at")) ps.orig[[1]]@pm

Other subsetting methods pm(a)[1:5,] mm(a)[1:5,]

Background Correction
##bgc will be the bg corrected version of estrogen data bgc <- bg.correct(a,method="rma") ##This plot shows the transformation plot(pm(a)[,1],pm(bgc)[,1],log="xy", main="PMs before and after background correction") bgps<- probeset(bgc,gn[1:2]) cbind(pm(ps[[1]])[,1], pm(bgps[[1]])[,1]) Other way to see the effects plot( density(pm(bgc)[,1])) lines( density(pm(a)[,1]),col=2) or plot( density(pm(bgc)[,1]),xlim=c(0,1000)) lines( density(pm(a)[,1]),col=2)

HW #4.a: perform the background correction to the pm of the estrogen data and compare the pms before and after background correction for the arrays 1-8. Divide the plot region by 2x4. What do you observe in the plot?

3.) Normalization.
The most common operation is certainly to convert probe level data to expression values. Typically this is achieved through the following sequence: 1. reading in probe level data. 2. background correction. 3. normalization. 4. probe specific background correction, e.g. subtracting MM. 5. summarizing the probe set values into one expression measure and, in some cases, a standard error for this summary.

We detail what we believe is a good way to proceed below. As mentioned the function expresso provides many options. (see the Biobase package). #generate the normalized data with median polish summary x <- expresso(a, bgcorrect.method = "rma", normalize.method = "quantiles", pmcorrect.method = "pmonly", summary.method = "medianpolish") class(x) #This is done by rma(a) in limma package

a. Use the function expresso to normalize the data and calculate expression values. The function expresso performs the steps background correction, normalization, probe specific correction, and summary value computation. We now show this using an AffyBatch included in the package for examples. The command data(affybatch.example) is used to load these data. Important parameters for the expresso function are:bgcorrect.method . bgcorrect.methods() [1] "mas" "none" "rma" "rma2" normalize.method . The normalization method to use. The available methods can be queried by using normalize.methods. normalize.methods() [1] "constant" "contrasts" [5] "methods" "qspline" [9] "vsn"

"invariantset" "quantiles"

"loess" "quantiles.robust"

pmcorrect.method The method for probe specific correction. The available methods are pmcorrect.methods() [1] "mas" "methods" "pmonly" "subtractmm" summary.method . The summary method to use. The available methods are express.summary.stat.methods() [1] "avgdiff" "liwong" "mas" "medianpolish" "playerout"

#Median Polish Example: Median Monthly Temperatures Jan 28.9 40.4 57.7 Feb 29.3 40.9 62.6 March 33.3 46.5 71.1 April 39.5 54.4 83.3 May 49.1 66.1 93.5 June 58.6 79.0 103.7 July 65.2 85.9 108.3

Caribou Washington Laredo

Two-way analysis: common +row +column +residual How to estimate the effect of common, row, column robustly? Get the row medians Jan 28.9 40.4 57.7 Feb 29.3 40.9 62.6 March 33.3 46.5 71.1 April 39.5 54.4 83.3 May 49.1 66.1 93.5 June 58.6 79.0 103.7 July 65.2 85.9 108.3

Caribou Washington Laredo

Subtract row medians from cells in their row 39.5 54.4 83.3 -10.6 -14.0 -25.6 -14.0 -10.2 -13.5 -20.7 -13.5 -6.2 -7.9 -12.1 -7.9 0 0 0 0 9.6 11.7 10.2 10.2 19.1 24.6 20.4 20.4 25.7 31.5 25.0 25.7

Get the column median Subtract column medians from cells in their columns

v v -.7

3.4 0 -11.6 -14.0

3.3 0 -7.2 -13.5

1.7 0 -4.2 -7.9

0 0 0 0

-.6 1.5 0 10.2

-1.3 4.2 0 20.4

0 5.8 -.7 25.7

Get the row medians and use check mark (v) if median is zero (done above) Get the row residuals. v v -.7 3.4 0 -10.9 3.3 0 -6.5 1.7 0 -3.5 0 0 .7 -.6 1.5 .7 -1.3 4.2 .7 0 5.8 0

Get the column medians and use check mark (x) if median is zero 3.4 0 -10.9 v 3.3 0 -6.5 v 1.7 0 -3.5 v 0 0 .7 v -.6 1.5 .7 .7 -1.3 4.2 .7 .7 0 5.8 0 v

Get the residuals and row medians. Use check mark (x) if median is zero v v v 3.4 0 -10.9 v 3.3 0 -6.5 v 1.7 0 -3.5 v 0 0 .7 v -1.3 .8 0 v -2.0 3.5 0 v 0 5.8 0 v

Stop if all column medians and row medians are zero. Putting together Row median effect: (39.5, 54.4, 83.3) + (0, 0, -.7) = (39.5, 54.4, 82.6)= 54.4 + (-14.9, 0, 28.2) Column median effect: (-14.0, -13.5, -7.9, 0, 10.2, 20.4, 25.7) + (0, 0, 0, 0, .7, .7, 0) = (-14.0, -13.5, -7.9, 0, 10.9, 21.1, 25.7) R Program Median polish for matrix x

z <- as.matrix(x) nr <- nrow(z) nc <- ncol(z) t <- 0 r <- numeric(nr) c <- numeric(nc) oldsum <- 0 rdelta <- apply(z, 1, median) z <- z - matrix(rdelta, nr = nr, nc = nc) #residual matrix r <- r + rdelta #row median delta <- median(c) c <- c - delta t <- t + delta cdelta <- apply(z, 2, median) z <- z - matrix(cdelta, nr = nr, nc = nc, byrow = TRUE) #residual matrix c <- c + cdelta #column median delta <- median(r) r <- r - delta #adjusted row median t <- t + delta #overall median newsum <- sum(abs(z)) abs(newsum - oldsum) oldsum <- newsum

#converged <- newsum == 0 || abs(newsum - oldsum) < eps * newsum #if (converged) break #oldsum <- newsum ans <- list(overall = t, row = r, col = c, residuals = z)

Putting together as a function Mpolish<-function (x, eps = 0.001, maxiter = 10) { for (iter in 1:maxiter) { #body of the function from above
converged <- newsum == 0 || abs(newsum - oldsum) < eps * newsum if (converged) break oldsum <- newsum } ans <- list(overall = t, row = r, col = c, residuals = z) ans}

Recall Suppose we have j=1,,J probes and i=1,,I arrays for a given probeset. Fit a robust linear model with probe and chip effects to log transformed data. yij = + j + bi + ij where yij is BG corrected PM in log unit, j is probe-effect and bi chip effect.

Expression is then (on a log scale) given by + bi

library(affy) library(estrogen) library(vsn) datadir = system.file("extdata", package = "estrogen") setwd(datadir) load("workspace.RData") image.RData includes the expression set x and the affybatch a. Then you can continue with the next paragraph. b. What are other available methods for normalization, and expression value calculation? You can consult the vignettes for the affy package for this. Choose another method (for example, MAS5 or RMA) and compare the results. For example, look at scatterplots of the probe set summaries for the same arrays between different methods. normalize.methods(a) [1] "constant" "contrasts" "invariantset" "loess" [5] "qspline" "quantiles" "quantiles.robust" "vsn" express.summary.stat.methods [1] "avgdiff" "liwong" "mas" "medianpolish" "playerout"

Normalize function
norm.example=normalize(a,method="invariantset") class(norm.example) #Still affyBatch class normalize.methods(a) [1] "constant" "contrasts" "invariantset" "loess" [5] "qspline" "quantiles" "quantiles.robust" Other methods norm.example2=normalize(a,method="quantiles") pm(norm.example2)[1:16,] plot(pm(a), pm(norm.example2), pch=".") norm.example3=normalize(a,method="loess") plot(pm(a),pm(norm.example3),pch=".") Other Bioconductor packages include some related generic functions: 'normalizeWithinArrays', and 'normalizeBetweenArrays', in the LIMMA package, and 'maNorm' in the marrayNorm package.

Expression values
From raw probe intensities to expression values expresso function: Goes from raw probe intensities to expression values x <- expresso(a, bg.correct = FALSE, normalize.method = "vsn", normalize.param = list(subsample = 1000), pmcorrect.method = "pmonly", summary.method = "medianpolish") #This will store expression values, in the object x, as an object of class exprSet x Expression Set (exprSet) with 12625 genes 8 samples phenoData object with 2 variables and 8 cases varLabels estrogen: read from file time.h: read from file x <- expresso(a, bg.correct = TRUE, bgcorrect.method=rma, normalize.method = "quantiles", pmcorrect.method = "pmonly", summary.method = "medianpolish") Important parameters for the expresso function are:bgcorrect.method . bgcorrect.methods() [1] "mas" "none" "rma" "rma2" normalize.method . The normalization method to use. The available methods can be queried by using normalize.methods. data(affybatch.example) normalize.methods(affybatch.example) [1] "constant" "contrasts" "invariantset" "loess" [5] "qspline" "quantiles" "quantiles.robust" pmcorrect.method The method for probe specific correction. The available methods are pmcorrect.methods() [1] "mas" "pmonly" "subtractmm" summary.method . The summary method to use. The available methods are express.summary.stat.methods() [1] "avgdiff" "liwong" "mas" "medianpolish" "playerout" expresso( afbatch, # background correction bg.correct = TRUE, bgcorrect.method = NULL, bgcorrect.param = list(), # normalize normalize = TRUE, normalize.method = NULL, normalize.param = list(), # pm correction pmcorrect.method = NULL,

pmcorrect.param = list(), # expression values summary.method = NULL, summary.param = list(), summary.subset = NULL, # misc. verbose = TRUE, widget = FALSE) x1 <- expresso(a, bgcorrect.method = none, normalize = FALSE, , pmcorrect.method = "pmonly", summary.method = "medianpolish") #this will perform the median polish for each of the genes in log2 units.. exprs(x1)[1:3,] HW#4 b): Build your own median polish function. HW#4 c): Compare it with your own median polish function for the first 3 genes. HW#4 d): Combine background correction (with rma or mas) and normalization (quantile or loess) with your summary method (median polish or avgdiff). What are the differences? Which do you prefer? Try 3 combinations 1) rma vs mas for quantile+median polish 2) quantile vs loess for rma, median polish 3) median polish vs avgdiff for rma and quantile

The parameter subsample determines the time consumption, as well as the precision of the calibration. The default (if you leave away the parameter normalize.param = list(subsample=1000)) is 20000; here we chose a smaller value for the sake of demonstration. There is the possibility that expresso is not working properly due to memory problems (normally it should work with 384 MB). Then you should end this session, start a new R session, and load the libraries and data by typing 5.) Looking at the CEL file images. The image function allows us to look at the spatial distribution of the intensities on a chip. This can be useful for quality control. Fortunately, all of the 8 celfiles that we have just loaded do not show any remarkable spatial artifacts. image(a[, 1]) #low10-1.cel image of estrogen data But we have another example: badc = ReadAffy("bad.cel") image(badc) Note that in these images, row 1 is at the bottom, and row 640 at the top. 6.) Histograms. Another way to visualize what is going on on a chip is to look at the histogram of 4 its intensity distribution. Because of the large dynamical range (O(10 )), it is useful to look at the log-transformed values: hist(log2(intensity(a[, 4])), breaks = 100, col = "blue") 7.) Boxplot. To compare the intensity distribution across several chips, we can look at the boxplots, both of the raw intensities a and the normalized probe set values x: par(mfrow=c(1,2)) boxplot(a, col = "red") boxplot(data.frame(exprs(x)), col = "blue") #x is expression summary from the normalized bg corrected pm data.

In the commands above, note the different syntax: a is an object of type AffyBatch, and the boxplot function has been programmed to know automatically what to do with it. exprs(x) is an object of type matrix. What happens if you do boxplot(x) or boxplot(exprs(x))? class(x) [1] "ExpressionSet" attr(,"package") [1] "Biobase" class(exprs(x)) [1] "matrix" 8.) Scatterplot. The scatterplot is a visualization that is useful for assessing the variation (or reproducibility, depending on how you look at it) between chips. We can look at all probes, the perfect match probes only, the mismatch probes only, and of course also at the normalized, probeset-summarized data: par(mfrow=c(2,2)) plot(exprs(a)[, 1:2], log = "xy", pch = ".", main = "all") plot(pm(a)[, 1:2], log = "xy", pch = ".", main = "pm") plot(mm(a)[, 1:2], log = "xy", pch = ".", main = "mm") plot(exprs(x)[, 1:2], pch = ".", main = "x") HW #4 d). Why are the arrays that were made at t = 48h much brighter than those at t = 10h? Look at histograms and scatterplots of the probe intensities from chips at 10h and at 48h to see whether you can find any evidence of saturation, changes in experimental protocol, or other quality problems. Distinguish between probes that are supposed to represent genes (you can access these e.g. through the functions pm()) and control probes.

9.) Heatmap. Select the 50 genes with the highest variation (standard deviation) across chips. rsd <- apply(exprs(x),1,sd) sel <- order(rsd, decreasing = TRUE)[1:50] heatmap(exprs(x)[sel, ], col = gentlecol(256)) ?heatmap #Will cover this topic when we learn clustering and other data mining

10.) ANOVA.
Now we can start analyzing our data for biological effects. We set up a linear model with main effects for the level of estrogen (estrogen) and the time (time.h). Both are factors with 2 levels. lm.coef = function(y) lm(y ~ estrogen * time.h)$coefficients eff = esApply(x, 1, lm.coef) For each gene, we obtain the fitted coefficients for main effects and interaction: dim(eff) [1] 4 12625 rownames(eff) [1] "(Intercept)" "estrogenpresent" "time.h" [4] "estrogenpresent:time.h" affyids <- colnames(eff)

Lets bring up the mapping from the vendors probe set identifier to gene names. library(hgu95av2) ls("package:hgu95av2") Lets now first look at the estrogen main effect, and print the top 3 genes with largest effect in one direction, as well as in the other direction. Then, look at the estrogen:time interaction. lowest <- sort(eff[2, ], decreasing = FALSE)[1:3] mget(names(lowest), hgu95av2GENENAME) $"37294_at" [1] "B-cell translocation gene 1, anti-proliferative" $"36617_at" [1] "inhibitor of DNA binding 1, dominant negative helix-loop-helix protein" $"846_s_at" [1] "BCL2-antagonist/killer 1" highest <- sort(eff[2, ], decreasing = TRUE)[1:3] mget(names(highest), hgu95av2GENENAME) $"910_at" [1] "thymidine kinase 1, soluble" $"31798_at" [1] "trefoil factor 1 (breast cancer, estrogen-inducible sequence expressed in)" $"40117_at" [1] "MCM6 minichromosome maintenance deficient 6 (MIS5 homolog, S. pombe) (S. cerevisiae)" hist(eff[4, ], breaks = 100, col = "blue", main = "estrogen:time interaction") highia <- sort(eff[4, ], decreasing = TRUE)[1:3] mget(names(highia), hgu95av2GENENAME) $"1651_at" [1] "ubiquitin-conjugating enzyme E2C" $"40412_at" [1] "pituitary tumor-transforming 1" $"1945_at" [1] "cyclin B1" About linear model fit

ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14) trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69) group <- gl(2,10,20, labels=c("Ctl","Trt")) weight <- c(ctl, trt) lm.D9 <- lm(weight ~ group) class(lm.D9) names(lm.D9) names(summary(lm.D9)) summary(lm.D9)$coefficients Estimate Std. Error t value Pr(>|t|) (Intercept) 5.032 0.2202177 22.850117 9.547128e-15 groupTrt -0.371 0.3114349 -1.191260 2.490232e-01

dim(summary(lm.D9)$coefficients) summary(lm.D9)$coefficients[-1,4] oeff<-eff[,order(eff[3,])] #list eff in order of eff[3,] oeff[,1:10] #first 10


HW #4 e). Modify the function lm.coef to produce the p-values and print corresponding P-values for the effects of "estrogenpresent" "time.h" "estrogenpresent:time.h" and calculate the p-values for estrogen data. Find the top 10 genes with significant P-values for each variable.

HW #4 f). Modify the function lm.coef to produce the t-values and print corresponding t-values for the effects of "estrogenpresent" "time.h" "estrogenpresent:time.h" and calculate the t-values for estrogen data. Find the top 10 genes with significant t-values for each variable.

Вам также может понравиться