9512.net
甜梦文库
当前位置:首页 >> >>

Limma linear models for microarray data


limma: Linear Models for Microarray Data User’s Guide
Gordon K. Smyth, Matthew Ritchie, Natalie Thorne and James Wettenhall The Walter and Eliza Hall Institute of Medical Research Melbourne, Australia 29 August 2006

This free open-source software implements academic research by the authors and co-workers. If you use it, please support the project by citing the appropriate journal articles listed in Section 2.1.

Contents
1 Introduction 2 Preliminaries 2.1 Citing limma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 How to get help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Quick Start 3.1 A brief introduction to R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Sample limma Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Data Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Reading Two-Color Data 4.1 Scope of this Chapter . . . 4.2 Recommended Files . . . . 4.3 The Targets Frame . . . . 4.4 Reading in Intensity Data 4.5 Spot Quality Weights . . . 4.6 Reading the Gene List . . 4.7 Printer Layout . . . . . . 4.8 The Spot Types File . . . 5 Data Exploration 6 Pre-Processing Two-Color Data 6.1 Background Correction . . . . . . . . . . 6.2 Within-Array Normalization . . . . . . . 6.3 Between-Array Normalization . . . . . . 6.4 Using Objects from the marray Package 3 4 4 5 6 7 7 8 9 11 11 11 11 13 15 16 16 17 19 20 20 22 24 27 28 28 29 30

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

7 Linear Models Overview 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 A?ymetrix and Other Single-Channel Designs . . . . . . . . . . . . . . . . . . 7.3 Common Reference Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

7.4

Direct Two-Color Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31 33 33 33 33 34 38 38 41 42 43 46 48 50 50 51 53 53 64 67 70 74 79

8 Speci?c Designs 8.1 Simple Comparisons . . . . . . . 8.1.1 Replicate Arrays . . . . . 8.1.2 Dye Swaps . . . . . . . . . 8.2 Technical Replication . . . . . . . 8.3 Paired Samples . . . . . . . . . . 8.4 Two Groups: Common Reference 8.5 Two Groups: A?ymetrix . . . . . 8.6 Several Groups . . . . . . . . . . 8.7 Factorial Designs . . . . . . . . . 8.8 Time Course Experiments . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

9 Separate Channel Analysis of Two-Color Data 10 Statistics for Di?erential Expression 10.1 Summary Top-Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Fitted Model Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Case Studies 11.1 Swirl Zebra?sh: A Single-Sample Experiment . . . . . . . . . . . . . . . 11.2 ApoAI Knockout Data: A Two-Sample Experiment . . . . . . . . . . . . 11.3 Ecoli Lrp Data: A?ymetrix Data with Two Targets . . . . . . . . . . . . 11.4 Estrogen Data: A 2x2 Factorial Experiment with A?ymetrix Arrays . . . 11.5 Weaver Mutant Data: A 2x2 Factorial Experiment with Two-Color Data 11.6 Bob Mutant Data: Within-Array Replicate Spots . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

2

Chapter 1 Introduction
Limma is a package for the analysis of gene expression microarray data, especially the use of linear models for analysing designed experiments and the assessment of di?erential expression. Limma provides the ability to analyse comparisons between many RNA targets simultaneously. It has features which make the analyses stable even for experiments with small number of arrays—this is achieved by borrowing information across genes. The normalization and exploratory data analysis functions are for two-colour spotted microarrays. The linear model and di?erential expression functions apply to all microarrays including A?ymetrix and other single-channel microarray experiments. This guide gives a tutorial-style introduction to the main limma features but does not describe every feature of the package. A full description of the package is given by the individual function help documents available from the R online help system. To access the online help, type help(package=limma) at the R prompt or else start the html help system using help.start() or the Windows drop-down help menu. The Bioconductor package marray provides alternative functions for reading and normalizing spotted microarray data. The marray package provides ?exible location and scale normalization routines for log-ratios from two-color arrays. The limma package overlaps with marray in functionality but is based on a more general separation between within-array and betweenarray normalization. If you are using limma in conjunction with marray, see Section 6.4. The Bioconductor package a?y provides functions for reading and normalizing A?ymetrix microarray data. Advice on how to use limma with the a?y package is given throughout the User’s Guide, see for example Section 7.2 and the E. coli and estrogen case studies. This guide describes limma as a command-driven package. Packages limmaGUI and a?ylmGUI are also available which provides graphical user interfaces to the most commonly used functions in limma [29]. Both packages are available from Bioconductor or from http://bioinf. wehi.edu.au/limmaGUI. The package limmaGUI is for use with two-color data while a?mGUI is for A?ymetrix data. This user’s guide was prepared using R Version 2.2.0 for Windows and limma version 2.4.1. The limma homepage is http://bioinf.wehi.edu.au/limma.

3

Chapter 2 Preliminaries
2.1 Citing limma

Limma is an implementation of a body of methodological research by the authors and coworkers. To give fair professional credit, please cite the appropriate methodological papers whenever you use results from the limma software in a publication. Such citations are the main means by which the authors receive professional credit for their work. If you use limma for di?erential expression analysis, please cite: Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing di?erential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, Vol. 3, No. 1, Article 3. http://www.bepress.com/sagmb/vol3/iss1/art3 The above article describes the linear modeling approach implemented by lmFit and the empirical Bayes statistics implemented by eBayes, topTable etc. If you use limma with duplicate spots or technical replication, please cite Smyth, G. K., Michaud, J., and Scott, H. (2005). The use of within-array replicate spots for assessing di?erential expression in microarray experiments. Bioinformatics 21(9), 2067–2075. http://bioinformatics.oxfordjournals.org/cgi/content/short/21/9/2067 The above article describes the theory behind the duplicateCorrelation function. If you use limma for normalization of two-color microarray data, please cite: Smyth, G. K., and Speed, T. P. (2003). Normalization of cDNA microarray data. Methods 31, 265–273. The above article describes the functions read.maimages, normalizeWithinArrays, normalizeBetweenArrays etc, including the use of spot quality weights. If you use limma to estimate array quality weights, please cite:

4

Ritchie, M. E., Diyagama, D., Neilson, van Laar, R., J., Dobrovic, A., Holloway, A., and Smyth, G. K. (2006). Empirical array quality weights for microarray data. BMC Bioinformatics 7, 261. http://www.biomedcentral.com/1471-2105/7/261 The above article describes the funtions arrayWeights, arrayWeightsSimple etc. The limma software itself can be cited as: Smyth, G. K. (2005). Limma: linear models for microarray data. In: Bioinformatics and Computational Biology Solutions using R and Bioconductor, R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, W. Huber (eds.), Springer, New York, pages 397–420. The above article describes the software package in the context of the Bioconductor project and surveys the range of experimental designs for which the package can be used, including spot-speci?c dye-e?ects. The pre-processing capabilities of the package are also described but more brie?y, with examples of background correction, spot quality weights and ?ltering with control spots. This article is also the best current reference for the normexp background correction method. Finally, if you are using one of the menu-driven interfaces to the software, please cite the appropriate one of Wettenhall, J. M., and Smyth, G. K. (2004). limmaGUI: a graphical user interface for linear modeling of microarray data. Bioinformatics, 20, 3705–3706. Wettenhall, J. M., Simpson, K. M., Satterley, K., and Smyth, G. K. (2006). a?ylmGUI: a graphical user interface for linear modeling of single channel microarray data. Bioinformatics 22, 897–899.

2.2

Installation

Limma is a package for the R computing environment and it is assumed that you have already installed R. See the R project at http://www.r-project.org. Installing from CRAN. Limma is available as a contributed package from the R Project CRAN site. If you are using R on a system with a suitable internet connection and with installation privileges on your computer, you should be able to install it via
> install.packages("limma")

at the R prompt from an internet-connected computer. If you are using Windows, use the drop-down menu Packages Install package(s) from CRAN . . .. Installing from Bioconductor. Limma is available as part of the Bioconductor project at http://www.bioconductor.org. Bioconductor works on a 6-monthly o?cial release cycle, lagging each major R release by a few weeks. This means that Bioconductor software is 5

updated only once every six months, unless you are using the developmental version of R. Updates of limma between the Bioconductor o?cial releases can be obtained from CRAN. Change-log. Limma is updated frequently, often a couple of times a week. Once you have installed limma, the change-log can also be viewed from the R prompt. To see the most recent 20 lines type:
> changeLog(n=20)

2.3

How to get help

Most questions about limma will hopefully be answered by the documentation or references. If you’ve run into a question which isn’t addressed by the documentation, or you’ve found a con?ict between the documentation and software itself, then there is an active support community which can o?er help. The authors of the package always appreciate receiving reports of bugs in the package functions or in the documentation. The same goes for well-considered suggestions for improvements. Any other questions or problems concerning limma should be sent to the Bioconductor mailing list bioconductor@stat.math.ethz.ch. To subscribe to the mailing list, see https://stat.ethz.ch/mailman/listinfo/bioconductor. Please send requests for general assistance and advice to the mailing list rather than to the individual authors. Users posting to the mailing list for the ?rst time should read the helpful posting guide at http: //www.bioconductor.org/doc/postingGuide.html. Note that each function in limma has it’s own online help page, as described in the next section. Mailing list etiquette requires that you read the relevant help page carefully before posting a problem to the list.

6

Chapter 3 Quick Start
3.1 A brief introduction to R

R is a program for statistical computing. It is a command-driven language meaning that you have to type commands into it rather than pointing and clicking using a mouse. In this guide it will be assumed that you have successfully downloaded and installed R from http://www.r-project.org. A good way to get started is to type
> help.start()

at the R prompt or, if you’re using R for Windows, to follow the drop-down menu items Help Html help. Following the links Packages limma from the html help page will lead you to the contents page of help topics for functions in limma. Before you can use any limma commands you have to load the package by typing
> library(limma)

at the R prompt. You can get help on any function in any loaded package by typing ? and the function name at the R prompt, for example
> ?read.maimages

or equivalently
> help("read.maimages")

for detailed help on the read.maimages function. The individual function help pages are especially important for listing all the arguments which a function will accept and what values the arguments can take. A key to understanding R is to appreciate that anything that you create in R is an “object”. Objects might include data sets, variables, functions, anything at all. For example
> x <- 2

will create a variable x and will assign it the value 2. At any stage of your R session you can type 7

> objects()

to get a list of all the objects you have created. You see show the contents of any object by typing the name of the object at the prompt, for example either of the following commands will print out the contents of x:
> show(x) > x

We hope that you can use limma without having to spend a lot of time learning about the R language itself but a little knowledge in this direction will be very helpful, especially when you want to do something not explicitly provided for in limma or in the other Bioconductor packages. For more details about the R language see An Introduction to R which is available from the online help. For more background on using R for statistical analyses see [5].

3.2

Sample limma Session

This is a quick overview of what an analysis might look like. The ?rst example assumes four replicate two-color arrays, the second and fourth of which are dye-swapped. We assume that the images have been analyzed using GenePix to produce a .gpr ?le for each array and that a targets ?le targets.txt has been prepared with a column containing the names of the .gpr ?les.
> library(limma) > targets <- readTargets("targets.txt")

Set up a ?lter so that any spot with a ?ag of ?99 or less gets zero weight.
> f <- function(x) as.numeric(x$Flags > -99)

Read in the data.
> RG <- read.maimages(targets$FileName, source="genepix", wt.fun=f)

The following command implements a type of adaptive background correction. This is optional but recommended for GenePix data.
> RG <- backgroundCorrect(RG, method="rma")

Print-tip loess normalization:
> MA <- normalizeWithinArrays(RG)

Estimate the fold changes and standard errors by ?tting a linear model for each gene. The design matrix indicates which arrays are dye-swaps.
> fit <- lmFit(MA, design=c(-1,1,-1,1))

Apply empirical Bayes smoothing to the standard errors. 8

> fit <- eBayes(fit)

Show statistics for the top 10 genes.
> topTable(fit)

The second example assumes A?ymetrix arrays hybridized with either wild-type (wt) or mutant (mt) RNA. There should be three or more arrays in total to ensure some replication. The targets ?le is now assumed to have another column Genotype indicating which RNA source was hybridized on each array.
> library(affy) > library(limma) > targets <- readTargets("targets.txt")

Read and pre-process the A?ymetrix CEL ?le data.
> ab <- ReadAffy(filenames=targets$FileName) > eset <- rma(ab)

Form an appropriate design matrix for the two RNA sources and ?t linear models. The design matrix has two columns. The ?rst represents log-expression in the wild-type and the second represents the log-ratio between the mutant and wild-type samples. See Section 8.5 for more details on the design matrix.
> > > > design <- cbind(WT=1, MUvsWT=targets$Genotype=="mu") fit <- lmFit(eset, design) fit <- eBayes(fit) topTable(fit, coef="MUvsWT")

This code ?ts the linear model, smooths the standard errors and displays the top 10 genes for the mutant versus wild-type comparison.

3.3
RGList.

Data Objects
Red-Green list. A class used to store raw intensities as they are read in from an image analysis output ?le, usually by read.maimages(). Intensities converted to M-values and A-values, i.e., to within-spot and wholespot contrasts on the log-scale. Usually created from an RGList using MA.RG() or normalizeWithinArrays(). Objects of this class contain one row for each spot. There may be more than one spot and therefore more than one row for each probe.

There are four main types of data objects created and used in limma:

MAList.

MArrayLM. Store the result of ?tting gene-wise linear models to the normalized intensities or log-ratios. Usually created by lmFit(). Objects of this class normally contain one row

for each unique probe. 9

TestResults.

Store the results of testing a set of contrasts equal to zero for each probe. Usually created by decideTests(). Objects of this class normally contain one row for each unique probe.

For those who are familiar with matrices in R, all these objects are designed to obey many analogies with matrices. In the case of RGList and MAList, rows correspond to spots and columns to arrays. In the case of MarrayLM, rows correspond to unique probes and columns to parameters or contrasts. The functions summary, dim, length, ncol, nrow, dimnames, rownames, colnames have methods for these classes. For example
> dim(RG) [1] 11088 4

shows that the RGList object RG contains data for 11088 spots and 4 arrays.
> colnames(RG)

will give the names of the ?lenames or arrays in the object, while if fit is an MArrayLM object then
> colnames(fit)

would give the names of the coe?cients in the linear model ?t. Objects of any of these classes may be subsetted, so that RG[,j] means the data for array j and RG[i,] means the data for probes indicated by the index i. Multiple data objects may be combined using cbind, rbind or merge. Hence
> RG1 <- read.maimages(files[1:2], source="genepix") > RG2 <- read.maimages(files[3:5], source="genepix") > RG <- cbind(RG1, RG2)

is equivalent to
> RG <- read.maimages(files[1:5], source="genepix")

Alternatively, if control status has been set in the an MAList object then
> i <- MA$genes$Status=="Gene" > MA[i,]

might be used to eliminate control spots from the data object prior to ?tting a linear model.

10

Chapter 4 Reading Two-Color Data
4.1 Scope of this Chapter

This chapter is for two-color arrays. If you are using A?ymetrix arrays, you should use the a?y or a?yPLM packages to read and normalize the data. If you have single channel arrays other than A?ymetrix, you will need to the read the intensity data into your R session yourself using the basic R read functions such as read.table. You will need to create a matrix containing the log-intensities with rows for probes and columns for arrays.

4.2

Recommended Files

We assume that an experiment has been conducted with one or more microarrays, all printed with the same library of probes. Each array has been scanned to produce a TIFF image. The TIFF images have then been processed using an image analysis program such a ArrayVision, ImaGene, GenePix, QuantArray or SPOT to acquire the red and green foreground and background intensities for each spot. The spot intensities have then been exported from the image analysis program into a series of text ?les. There should be one ?le for each array or, in the case of Imagene, two ?les for each array. You will need to have the image analysis output ?les. In most cases these ?les will include the IDs and names of the probes and possibly other annotation information. A few image analysis programs, for example SPOT, do not write the probe IDs into the output ?les. In this case you will also need a genelist ?le which describes the probes. It most cases it is also desirable to have a targets ?le which describes which RNA sample was hybridized to each channel of each array. A further optional ?le is the spot types ?le which identi?es special probes such as control spots.

4.3

The Targets Frame

The ?rst step in preparing data for input into limma is usually to create a targets ?le which lists the RNA target hybridized to each channel of each array. It is normally in tab-delimited 11

text format and should contain a row for each microarray in the experiment. The ?le can have any name but the default is Targets.txt. If it has the default name, it can be read into the R session using
> targets <- readTargets()

Once read into R, it becomes the targets frame. The targets frame normally contains a FileName column, giving the name of the imageanalysis output ?le, a Cy3 column giving the RNA type labelled with Cy3 dye for that slide and a Cy5 column giving the RNA type labelled with Cy5 dye for that slide. Other columns are optional. The targets ?le can be prepared using any text editor but spreadsheet programs such as Microsoft Excel are convenient. The targets ?le for the Swirl case study includes optional SideNumber and Date columns:

It is often convenient to create short readable labels to associate with each array for use in output and in plots, especially if the ?le names are long or non-intuitive. A column containing these labels can be included in the targets ?le, for example the Name column used for the ApoAI case study:

12

This column can be used to created row names for the targets frame by
> targets <- readTargets("targets.txt", row.names="Name")

The row names can be propagated to become array names in the data objects when these are read in. For ImaGene ?les, the FileName column is split into a FileNameCy3 column and a FileNameCy5 because ImaGene stores red and green intensities in separate ?les. This is a short example:

4.4

Reading in Intensity Data

Let files be a character vector containing the names of the image analysis output ?les. The foreground and background intensities can be read into an RGList object using a command of the form
> RG <- read.maimages(files, source="<imageanalysisprogram>", path="<directory>")

where <imageanalysisprogram> is the name of the image analysis program and <directory> is the full path of the directory containing the ?les. If the ?les are in the current R working directory then the argument path can be omitted; see the help entry for setwd for how to set the current working directory. The ?le names are usually read from the Targets File. For example, the Targets File Targets.txt is in the current working directory together with the SPOT output ?les, then one might use
> targets <- readTargets() > RG <- read.maimages(targets$FileName, source="spot")

Alternatively, and even more simply, one may give the targets frame itself in place of the files argument as
> RG <- read.maimages(targets, source="spot")

In this case the software will look for the column FileName in the targets frame. If the ?les are GenePix output ?les then they might be read using
> RG <- read.maimages(targets, source="genepix")

13

given an appropriate targets ?le. Consult the help entry for read.maimages to see which other image analysis programs are supported. Files are assumed by default to be tab-delimited, although other separators can be speci?ed using the sep= argument. Reading data from ImaGene software is a little di?erent to that of other image analysis programs because the red and green intensities are stored in separate ?les. This means that the targets frame should include two ?lename columns called, say, FileNameCy3 and FileNameCy5, giving the names of the ?les containing the green and red intensities respectively. An example is given in Section 4.3. Typical code with ImaGene data might be
> targets <- readTargets() > files <- targets[,c("FileNameCy3","FileNameCy5")] > RG <- read.maimages(files, source="imagene")

For ImaGene data, the files argument to read.maimages() is expected to be a 2-column matrix of ?lenames rather than a vector. The following table gives the default estimates used for the foreground and backgound intensities: Source agilent bluefuse genepix genepix.median genepix.custom imagene quantarray scanarrayexpress smd.old smd spot spot.close.open Foreground Mean Signal AMPCH F Mean F Median Mean Signal Mean Background Median Signal None B Median B Median B Signal Median, or Signal Mean if auto segmentation has been used Intensity Background Mean Median B MEDIAN I MEAN Intensity (Mean) Background (Median) mean morph mean morph.close.open

The default estimates can be over-ridden by specifying the columns argument to read.maimages(). Suppose for example that GenePix has been used with a custom background method, and you wish to use median foregound estimates. This combination of foreground and background is not provided as a pre-set choice in limma, but you can specify it by
> RG <- read.maimages(files,source="genepix", + columns=list(R="F635 Median",G="F532 Median",Rb="B635",Gb="B532"))

What should you do if your image analysis program is not in the above list? If the image output ?les are in standard format, then you can supply the annotation and intensity column names yourself. For example,
> RG <- read.maimages(files, + columns=list(R="F635 Mean",G="F532 Mean",Rb="B635 Median",Gb="B532 Median"), + annotation=c("Block","Row","Column","ID","Name"))

14

is exactly equivalent to source="genepix". “Standard format” means here that there is a unique column name identifying each column of interest and that there are no lines in the ?le following the last line of data. Header information at the start of the ?le is acceptable, but extra lines at the end of the ?le will cause the read to fail. It is a good idea to look at your data to check that it has been read in correctly. Type
> show(RG)

to see a print out of the ?rst few lines of data. Also try
> summary(RG$R)

to see a ?ve-number summary of the red intensities for each array, and so on. It is possible to read the data in several steps. If RG1 and RG2 are two data sets corresponding to di?erent sets of arrays then
> RG <- cbind(RG1, RG2)

will combine them into one large data set. Data sets can also be subsetted. For example RG[,1] is the data for the ?rst array while RG[1:100,] is the data on the ?rst 100 genes.

4.5

Spot Quality Weights

It is desirable to use the image analysis output to compute a weight for each spot between 0 and 1 which indicates the reliability of the acquired intensities for that spot. For example, if the SPOT image analysis program is used and the size of an ideal perfectly circular spot is known to be 100 pixels, then one might use
> RG <- read.maimages(files,source="spot",wt.fun=wtarea(100))

The function wtarea(100) gives full weight to spots with area 100 pixels and down-weights smaller and larger spots. Spots which have zero area or are more than twice the ideal size are given zero weight. This will create a component called weights in the RG-list. The weights will be used automatically by functions such as normalizeWithinArrays which operate on the RG-list. With GenePix data
> RG <- read.maimages(files,source="genepix",wt.fun=wtflags(0.1))

will give weight 0.1 to any spot which receives a negative ?ag from the GenePix program. The appropriate way to computing spot quality weights depends on the image analysis program that you are used. Consult the help entry QualityWeights to see what quality weight functions are available. The wt.fun argument is very ?exible and allows you to construct your own weights. The wt.fun argument can be any function which takes a data set as argument and computes the desired weights. For example, if you wish to give zero weight to all Genepix ?ags less than -50 you could use
> myfun <- function(x) as.numeric(x$Flags > -50.5) > RG <- read.maimages(files, source="genepix", wt.fun=myfun)

15

The wt.fun facility can be used to compute weights based on any number of columns in the image analysis ?les. For example, some researchers like to ?lter out spots if the foreground mean and median from GenePix for a given spot di?er by more than a certain threshold, say 50. This could be achieved by
> myfun <- function(x, threshold=50) { + okred <- abs(x[,"F635 Median"]-x[,"F635 Mean"]) < threshold + okgreen <- abs(x[,"F532 Median"]-x[,"F532 Mean"]) < threshold + as.numeric(okgreen & okred) +} > RG <- read.maimages(files, source="genepix", wt.fun=myfun)

Then all the “bad” spots will get weight zero which, in limma, is equivalent to ?agging them out. The de?nition of myfun here could be replaced with any other code to compute weights using the columns in the GenePix output ?les.

4.6

Reading the Gene List

The RGList read by read.maimages() will almost always contain a component called genes containing the IDs and other annotation information associated with the probes. The only exceptions are SPOT data, source="spot", or when reading generic data, source="generic", without setting the annotation argument, annotation=NULL. Try
> names(RG$genes)

to see if the genes component has been set. If the genes component is not set, the probe IDs will need to be read from a separate ?le. If the arrays have been scanned with an Axon scanner, then the probes IDs will be available in a tab-delimited GenePix Array List (GAL) ?le. If the GAL ?le has extension “gal” and is in the current working directory, then it may be read into a data.frame by
> RG$genes <- readGAL()

Non-Genepix gene lists can be read into R using the function read.delim from R base.

4.7

Printer Layout

The printer layout is the arrangement of spots and blocks of spots on the arrays. The blocks are sometimes called print-tip groups or pin-groups or meta rows and columns. Each block corresponds to a print tip on the print-head used to print the arrays, and the layout of the blocks on the arrays corresponds to the layout of the tips on the print-head. The number of spots in each block is the number of times the print-head was lowered onto the array. Where possible, for example for Agilent, GenePix or ImaGene data, read.maimages will set the printer layout information in the component printer. Try
> names(RG$printer)

16

to see if the printer layout information has been set. If you’ve used readGAL to set the genes component, you may also use getLayout to set the printer information by
> RG$printer <- getLayout(RG$genes)

Note this will work only for GenePix GAL ?les, not for general gene lists.

4.8

The Spot Types File

The Spot Types ?le (STF) is another optional tab-delimited text ?le which allows you to identify di?erent types of spots from the gene list. The STF is used to set the control status of each spot on the arrays so that plots may highlight di?erent types of spots in an appropriate way. It is typically used to distinguish control spots from those corresponding to genes of interest and to distinguish positive from negative controls, ratio from calibration controls and so on. The STF should have a SpotType column giving the names of the di?erent spot-types. One or more other columns should have the same names as columns in the gene list and should contain patterns or regular expressions su?cient to identify the spot-type. Any other columns are assumed to contain plotting attributes, such as colors or symbols, to be associated with the spot-types. There is one row for each spot-type to be distinguished. The STF uses simpli?ed regular expressions to match patterns. For example, AA* means any string starting with AA, *AA means any code ending with AA, AA means exactly these two letters, *AA* means any string containing AA, AA. means AA followed by exactly one other character and AA\. means exactly AA followed by a period and no other characters. For those familiar with regular expressions, any other regular expressions are allowed but the codes ^ for beginning of string and $ for end of string should be excluded. Note that the patterns are matched sequentially from ?rst to last, so more general patterns should be included ?rst. The ?rst row should specify the default spot-type and should have pattern * for all the patternmatching columns. Here is a short STF appropriate for the ApoAI data:

17

In this example, the columns ID and Name are found in the gene-list and contain patterns to match. The asterisks are wildcards which can represent anything. Be careful to use upper or lower case as appropriate and don’t insert any extra spaces. The remaining column gives colors to be associated with the di?erent types of points. This code assumes of that the probe annotation data.frame includes columns ID and Name. This is usually so if GenePix has been used for the image analysis, but other image analysis software may use other column names. Here is a STF below appropriate for arrays with Lucidea Universal ScoreCard control spots.

If the STF has default name SpotTypes.txt then it can be read using
> spottypes <- readSpotTypes()

It is typically used as an argument to the controlStatus() function to set the status of each spot on the array, for example
> RG$genes$Status <- controlStatus(spottypes, RG)

18

Chapter 5 Data Exploration
It is advisable to display your data in various ways as a quality check and to check for unexpected e?ects. We recommend an imageplot of the raw log-ratios and an MA-plot of the raw data at least for each array as routine quality assessment displays. See the Swirl case study for some examples. The functions imageplot3by2 and plotMA3by2 can be used to automate the production of plots for all arrays in an experiment. The following is an example MA-Plot for an Incyte array with various spike-in and other controls. (Data courtesy of Rebecca McCracken and Steve Gerondakis, Walter and Eliza Hall Institute of Medical Research.) The plot was produced using
> spottypes <- readSpotTypes() > RG$genes$Status <- controlStatus(spottypes, RG) > plotMA(RG)

The array includes spike-in ratio controls which are 3-fold, 10-fold and 25-fold up and down regulated, as well as non-di?erentially expressed sensitivity controls and negative controls. 19

Chapter 6 Pre-Processing Two-Color Data
6.1 Background Correction

The default background correction action is to subtract the background intensity from the foreground intensity for each spot. If the RGList object has not already been background corrected, then normalizeWithinArrays will do this by default. Hence
> MA <- normalizeWithinArrays(RG)

is equivalent to
> RGb <- backgroundCorrect(RG, method="subtract") > MA <- normalizeWithinArrays(RGb)

However there are many other background correction options which may be preferable in certain situations. For the purpose of assessing di?erential expression, we often ?nd
> RG <- backgroundCorrect(RG, method="normexp", offset=50)

to be preferable to the simple background subtraction when using output from most image analysis programs. This method adjusts the foreground adaptively for the background intensities and results in strictly positive adjusted intensities, i.e., negative or zero corrected intensities are avoided. The use of an o?set damps the variation of the log-ratios for very low intensities spots towards zero. To illustrate some di?erences between the di?erent background correction methods we consider one cDNA array which was self-self hybridized, i.e., the same RNA source was hybridized to both channels. For this array there is no actual di?erential expression. The array was printed with a human 10.5k library and hybridized with Jurkatt RNA on both channels. (Data courtesy Andrew Holloway and Dileepa Diyagama, Peter MacCallum Cancer Centre, Melbourne.) The array included a selection of control spots which are highlighted on the plots. Of particular interest are the spike-in ratio controls which should show up and down fold changes of 3 and 10. The ?rst plot displays data acquired with GenePix software and background corrected by subtracting the median local background, which is the default with 20

GenePix data. The plot shows the typical wedge shape with fanning of the M-values at low intensities. The range of observed M-values dominates the spike-in ratio controls. The are also 1148 spots not shown on the plot because the background corrected intensities were zero or negative.

The second plot shows the same array background corrected with method="normexp" and offset=50. The spike-in ratio controls now standout clearly from the range of the M-values. All spots on the array are shown on the plot because there are now no missing M-values.

The third plot shows the same array quanti?ed with SPOT software and with “morph” background subtracted. This background estimator produces a similar e?ect to that with normexp. 21

The e?ect of using “morph” background or using method="normexp" with an o?set is to stabilize the variability of the M-values as a function of intensity. The empirical Bayes methods implemented in the limma package for assessing di?erential expression will yield most bene?t when the variabilities are as homogenous as possible between genes. This can best be achieved by reducing the dependence of variability on intensity as far as possible.

6.2

Within-Array Normalization

Limma implements a range of normalization methods for spotted microarrays. Smyth and Speed [23] describe some of the most commonly used methods. The methods may be broadly classi?ed into methods which normalize the M-values for each array separately (within-array normalization) and methods which normalize intensities or log-ratios to be comparable across arrays (between-array normalization). This section discusses mainly within-array normalization, which all that is usually required for the traditional log-ratio analysis of two-color data. Between-array normalization is discussed further in Section 6.3. Print-tip loess normalization [33] is the default normalization method and can be performed by
> MA <- normalizeWithinArrays(RG)

There are some notable cases where this is not appropriate. For example, Agilent arrays do not have print-tip groups, so one should use global loess normalization instead:
> MA <- normalizeWithinArrays(RG, method="loess")

Print-tip loess is also unreliable for small arrays with less than, say, 150 spots per print-tip group. Even larger arrays may have particular print-tip groups which are too small for printtip loess normalization if the number of spots with non-missing M-values is small for one or more of the print-tip groups. In these cases one should either use global "loess" normalization or else use robust spline normalization 22

> MA <- normalizeWithinArrays(RG, method="robustspline")

which is an empirical Bayes compromise between print-tip and global loess normalization, with 5-parameter regression splines used in place of the loess curves. Loess normalization assumes that the bulk of the probes on the array are not di?erentially expressed. It doesn’t assume that that there are equal numbers of up and down regulated genes or that di?erential expression is symmetric about zero, provided that the loess ?t is implemented in a robust fashion, but it is necessary that there be a substantial body of probes which do not change expression levels. This assumption can be suspect for boutique arrays where the total number of unique genes on the array is small, say less than 150, particularly if these genes have been selected for being speci?cally expressed in one of the RNA sources. In such a situation, the best strategy is to include on the arrays a series of non-di?erentially expressed control spots, such as a titration series of whole-library-pool spots, and to use the up-weighting method discussed below. A whole-library-pool means that one makes a pool of a library of probes, and prints spots from the pool at various concentrations [32]. The library should be su?ciently large than one can be con?dent that the average of all the probes is not di?erentially expressed. The larger the library the better. Good results have been obtained with library pools with as few as 500 clones. In the absence of such control spots, normalization of boutique arrays requires specialist advice. Any spot quality weights found in RG will be used in the normalization by default. This means for example that spots with zero weight (?agged out) will not in?uence the normalization of other spots. The use of spot quality weights will not however result in any spots being removed from the data object. Even spots with zero weight will be normalized and will appear in the output object, such spots will simply not have any in?uence on the other spots. If you do not wish the spot quality weights to be used in the normalization, their use can be over-ridden using
> MA <- normalizeWithinArrays(RG, weights=NULL)

The output object MA will still contain any spot quality weights found in RG, but these weights are not used in the normalization step. It is often useful to make use of control spots to assist the normalization process. For example, if the arrays contain a series of spots which are known in advance to be non-di?erentially expressed, these spots can be given more weight in the normalization process. Spots which are known in advance to be di?erentially expressed can be down-weighted. Suppose for example that the controlStatus() has been used to identify spike-in spots which are di?erentially expressed and a titration series of whole-library-pool spots which should not be di?erentially expressed. Then one might use
> w <- modifyWeights(RG$weights, RG$genes$Status, c("spikein","titration"), c(0,2)) > MA <- normalizeWithinArrays(RG, weights=w)

to give zero weight to the spike-in spots and double weight to the titration spots. The idea of up-weighting the titration spots is in the same spirit as the composite normalization method proposed by [32] but is more ?exible and generally applicable. The above code assumes that RG already contains spot quality weights. If not, one could use 23

> w <- modifyWeights(array(1,dim(RG)), RG$genes$Status, c("spikein","titration"), c(0,2)) > MA <- normalizeWithinArrays(RG, weights=w)

instead. Limma contains some more sophisticated normalization methods. In particular, some between-array normalization methods are discussed in Section 6.3 of this guide.

6.3

Between-Array Normalization

This section explores some of the methods available for between-array normalization of twocolor arrays. A feature which distinguishes most of these methods from within-array normalization is the focus on the individual red and green intensity values rather than merely on the log-ratios. These methods might therefore be called individual channel or separate channel normalization methods. Individual channel normalization is typically a prerequisite to individual channel analysis methods such as that provided by lmscFit(). Further discussion of the issues involved is given by [35]. This section shows how to reproduce some of the results given in [35]. The ApoAI data set from Section 11.2 will be used to illustrate these methods. We assume that the the ApoAI data has been loaded and background corrected as follows:
> load("ApoAI.RData")

An important issue to consider before normalizing between arrays is how background correction has been handled. For between-array normalization to be e?ective, it is important to avoid missing values in log-ratios which might arise from negative or zero corrected intensities. The function backgroundCorrect() gives a number of useful options. For the purposes of this section, the data has been corrected using the "minimum" method:
> RG.b <- backgroundCorrect(RG,method="minimum")

plotDensities displays smoothed empirical densities for the individual green and red chan-

nels on all the arrays. Without any normalization there is considerable variation between both channels and between arrays:
> plotDensities(RG.b)

24

After loess normalization of the M-values for each array the red and green distributions become essentially the same for each array, although there is still considerable variation between arrays:
> MA.p <-normalizeWithinArrays(RG.b) > plotDensities(MA.p)

Loess normalization doesn’t a?ect the A-values. Applying quantile normalization to the A-values makes the distributions essentially the same across arrays as well as channels: 25

> MA.pAq <- normalizeBetweenArrays(MA.p, method="Aquantile") > plotDensities(MA.pAq)

Applying quantile normalization directly to the individual red and green intensities produces a similar result but is somewhat noisier:
> MA.q <- normalizeBetweenArrays(RG.b, method="quantile") > plotDensities(MA.q, col="black") Warning message: number of groups=2 not equal to number of col in: plotDensities(MA.q, col = "black")

26

There are other between-array normalization methods not explored here. For example
normalizeBetweenArrays with method="vsn" gives an interface to the variance-stabilizing nor-

malization methods of the vsn package.

6.4

Using Objects from the marray Package

The package marray is a well known R package for pre-processing of two-color microarray data. Marray provides functions for reading, normalization and graphical display of data. Marray and limma are both descendants of the earlier and path-breaking SMA package available from http://www.stat.berkeley.edu/users/terry/zarray/Software/smacode.html but limma has maintained and built upon the original data structures whereas marray has converted to a fully formal data class representation. For this reason, Limma is backwardly compatible with SMA while marray is not. Normalization functions in marray focus on a ?exible approach to location and scale normalization of M-values, rather than the within and between-array approach of limma. Marray provides some normalization methods which are not in limma including 2-D loess normalization and print-tip-scale normalization. Although there is some overlap between the normalization functions in the two packages, both providing print-tip loess normalization, the two approaches are largely complementary. Marray also provides highly developed functions for graphical display of two-color microarray data. Read functions in marray produce objects of class marrayRaw while normalization produces objects of class marrayNorm. Objects of these classes may be converted to and from limma data objects using the convert package. marrayRaw objects may be converted to RGList objects and marrayNorm objects to MAList objects using the as function. For example, if Data is an marrayNorm object then
> library(convert) > MA <- as(Data, "MAList")

converts to an MAList object. marrayNorm objects can also be used directly in limma without conversion, and this is generally recommended. If Data is an marrayNorm object, then
> fit <- lmFit(Data, design)

?ts a linear model to Data as it would to an MAList object. One di?erence however is that the marray read functions tend to populate the maW slot of the marrayNorm object with qualitative spot quality ?ags rather than with quantitative non-negative weights, as expected by limma. If this is so then one may need
> fit <- lmFit(Data, design, weights=NULL)

to turn o? use of the spot quality weights.

27

Chapter 7 Linear Models Overview
7.1 Introduction

The package limma uses an approach called linear models to analyse designed microarray experiments. This approach allows very general experiments to be analysed just as easily as a simple replicated experiment. The approach is outlined in [25, 34]. The approach requires one or two matrices to be speci?ed. The ?rst is the design matrix which indicates in e?ect which RNA samples have been applied to each array. The second is the contrast matrix which speci?es which comparisons you would like to make between the RNA samples. For very simple experiments, you may not need to specify the contrast matrix. The philosophy of the approach is as follows. You have to start by ?tting a linear model to your data which fully models the systematic part of your data. The model is speci?ed by the design matrix. Each row of the design matrix corresponds to an array in your experiment and each column corresponds to a coe?cient which is used to describe the RNA sources in your experiment. With A?ymetrix or single-channel data, or with two-color with a common reference, you will need as many coe?cients as you have distinct RNA sources, no more and no less. With direct-design two-color data you will need one fewer coe?cient than you have distinct RNA sources, unless you wish to estimate a dye-e?ect for each gene, in which case the number of RNA sources and the number of coe?cients will be the same. Any set of independent coe?cients will do, providing they describe all your treatments. The main purpose of this step is to estimate the variability in the data, hence the systematic part needs to modelled so it can be distinguished from random variation. In practice the requirement to have exactly as many coe?cients as RNA sources is too restrictive in terms of questions you might want to answer. You might be interested in more or fewer comparisons between the RNA source. Hence the contrasts step is provided so that you can take the initial coe?cients and compare them in as many ways as you want to answer any questions you might have, regardless of how many or how few these might be. If you have data from A?ymetrix experiments, from single-channel spotted microarrays or from spotted microarrays using a common reference, then linear modeling is the same as ordinary analysis of variance or multiple regression except that a model is ?tted for every gene. With data of this type you can create design matrices as one would do for ordinary 28

modeling with univariate data. If you have data from spotted microarrays using a direct design, i.e., a connected design with no common reference, then the linear modeling approach is very powerful but the creation of the design matrix may require more statistical knowledge. For statistical analysis and assessing di?erential expression, limma uses an empirical Bayes method to moderate the standard errors of the estimated log-fold changes. This results in more stable inference and improved power, especially for experiments with small numbers of arrays [25]. For arrays with within-array replicate spots, limma uses a pooled correlation method to make full use of the duplicate spots [22].

7.2

A?ymetrix and Other Single-Channel Designs

A?ymetrix data will usually be normalized using the a?y package. We will assume here that the data is available as an exprSet object called eset. Such an object will have an slot containing the log-expression values for each gene on each array which can be extracted using exprs(eset). A?ymetrix and other single-channel microarray data may be analysed very much like ordinary linear models or anova models. The di?erence with microarray data is that it is almost always necessary to extract particular contrasts of interest and so the standard parametrizations provided for factors in R are not usually adequate. There are many ways to approach the analysis of a complex experiment in limma. A straightforward strategy is to set up the simplest possible design matrix and then to extract from the ?t the contrasts of interest. Suppose that there are three RNA sources to be compared. Suppose that the ?rst three arrays are hybridized with RNA1, the next two with RNA2 and the next three with RNA3. Suppose that all pair-wise comparisons between the RNA sources are of interest. We assume that the data has been normalized and stored in an exprSet object, for example by
> data <- ReadAffy() > eset <- rma(data)

An appropriate design matrix can be created and a linear model ?tted using
> design <- model.matrix(~ -1+factor(c(1,1,1,2,2,3,3,3))) > colnames(design) <- c("group1", "group2", "group3") > fit <- lmFit(eset, design)

To make all pair-wise comparisons between the three groups the appropriate contrast matrix can be created by
> contrast.matrix <- makeContrasts(group2-group1, group3-group2, group3-group1, levels=design) > fit2 <- contrasts.fit(fit, contrast.matrix) > fit2 <- eBayes(fit2)

A list of top genes di?erential expressed in group2 versus group1 can be obtained from
> topTable(fit2, coef=1, adjust="BH")

The outcome of each hypothesis test can be assigned using 29

> results <- decideTests(fit2)

A Venn diagram showing numbers of genes signi?cant in each comparison can be obtained from
> vennDiagram(results)

7.3

Common Reference Designs

Now consider two-color microarray experiments in which a common reference has been used on all the arrays. Such experiments can be analysed very similarly to A?ymetrix experiments except that allowance must be made for dye-swaps. The simplest method is to setup the design matrix using the modelMatrix() function and the targets ?le. As an example, we consider part of an experiment conducted by Jo¨lle Michaud, Catherine Carmichael and Dr Hamish Scott at e the Walter and Eliza Hall Institute to compare the e?ects of transcription factors in a human cell line. The targets ?le is as follows:
> targets <- readTargets("runxtargets.txt") > targets SlideNumber Cy3 Cy5 1 2144 EGFP AML1 2 2145 EGFP AML1 3 2146 AML1 EGFP 4 2147 EGFP AML1.CBFb 5 2148 EGFP AML1.CBFb 6 2149 AML1.CBFb EGFP 7 2158 EGFP CBFb 8 2159 CBFb EGFP 9 2160 EGFP AML1.CBFb 10 2161 AML1.CBFb EGFP 11 2162 EGFP AML1.CBFb 12 2163 AML1.CBFb EGFP 13 2166 EGFP CBFb 14 2167 CBFb EGFP

In the experiment, green ?uorescent protein (EGFP) has been used as a common reference. An adenovirus system was used to transport various transcription factors into the nuclei of HeLa cells. Here we consider the transcription factors AML1, CBFbeta or both. A simple design matrix was formed and a linear model ?t:
> design <- modelMatrix(targets,ref="EGFP") > design AML1 AML1.CBFb CBFb 1 1 0 0 2 1 0 0 3 -1 0 0 4 0 1 0 5 0 1 0 6 0 -1 0

30

7 8 9 10 11 12 13 14 > fit

0 0 1 0 0 -1 0 1 0 0 -1 0 0 1 0 0 -1 0 0 0 1 0 0 -1 <- lmFit(MA, design)

It is of interest to compare each of the transcription factors to EGFP and also to compare the combination transcription factor with AML1 and CBFb individually. An appropriate contrast matrix was formed as follows:
> contrast.matrix <- makeContrasts(AML1,CBFb,AML1.CBFb,AML1.CBFb-AML1,AML1.CBFb-CBFb, + levels=design) > contrast.matrix AML1 CBFb AML1.CBFb AML1.CBFb - AML1 AML1.CBFb - CBFb AML1 1 0 0 -1 0 AML1.CBFb 0 0 1 1 1 CBFb 0 1 0 0 -1

The linear model ?t can now be expanded and empirical Bayes statistics computed:
> fit2 <- contrasts.fit(fit, contrasts.matrix) > fit2 <- eBayes(fit2)

7.4

Direct Two-Color Designs

Two-colour designs without a common reference require the most statistical knowledge to choose the appropriate design matrix. A direct design is one in which there is no single RNA source which is hybridized to every array. As an example, we consider an experiment conducted by Dr Mireille Lahoud at the Walter and Eliza Hall Institute to compare gene expression in three di?erent populations of dendritric cells (DC).

This experiment involved six cDNA microarrays in three dye-swap pairs, with each pair used to compare two DC types. The design is shown diagrammatically above. The targets ?le was as follows: 31

> targets SlideNumber ml12med 12 ml13med 13 ml14med 14 ml15med 15 ml16med 16 ml17med 17

FileName ml12med.spot ml13med.spot ml14med.spot ml15med.spot ml16med.spot ml17med.spot

Cy3 CD4 CD8 DN CD8 CD4 DN

Cy5 CD8 CD4 CD8 DN DN CD4

There are many valid choices for a design matrix for such an experiment and no single correct choice. We chose to setup the design matrix as follows:
> design <- modelMatrix(targets, ref="CD4") Found unique target names: CD4 CD8 DN > design CD8 DN ml12med 1 0 ml13med -1 0 ml14med 1 -1 ml15med -1 1 ml16med 0 1 ml17med 0 -1

In this design matrix, the CD8 and DN populations have been compared back to the CD4 population. The coe?cients estimated by the linear model will correspond to the log-ratios of CD8 vs CD4 (?rst column) and DN vs CD4 (second column). After appropriate normalization of the expression data, a linear model was ?t using
> fit <- lmFit(MA, design, ndups=2)

The use of ndups is to specify that the arrays contained duplicates of each gene, see Section 11.6. The linear model can now be interrogated to answer any questions of interest. For this experiment it was of interest to make all pairwise comparisons between the three DC populations. This was accomplished using the contrast matrix
> contrast.matrix <- cbind("CD8-CD4"=c(1,0),"DN-CD4"=c(0,1),"CD8-DN"=c(1,-1)) > rownames(contrast.matrix) <- colnames(design) > contrast.matrix CD8-CD4 DN-CD4 CD8-DN CD8 1 0 1 DN 0 1 -1

The contrast matrix can be used to expand the linear model ?t and then to compute empirical Bayes statistics:
> fit2 <- constrast.fit(fit, contrast.matrix) > fit2 <- eBayes(fit2)

32

Chapter 8 Speci?c Designs
8.1
8.1.1

Simple Comparisons
Replicate Arrays

The simplest possible microarray experiment is one with a series of replicate two-color arrays all comparing the same two RNA sources. For a three-array experiment comparing wild type (wt) and mutant (mu) RNA, the targets ?le might contain the following entries: FileName Cy3 Cy5 File1 wt mu File2 wt mu File3 wt mu A list of di?erentially expressed genes might be found for this experiment by
> fit <- lmFit(MA) > fit <- eBayes(fit) > topTable(fit)

where MA holds the normalized data. The default design matrix used here is just a single column of ones. The experiment here measures the fold change of mutant over wild type. Genes which have positive M-values are more highly expressed in the mutant RNA while genes with negative M-values are more highly expressed in the wild type. The analysis is analogous to the classical single-sample t-test except that we have used empirical Bayes methods to borrow information between genes.

8.1.2

Dye Swaps

A simple modi?cation of the above experiment would be to swap the dyes for one of the arrays. The targets ?le might now be FileName Cy3 Cy5 File1 wt mu File2 mu wt File3 wt mu 33

Now the analysis would be
> > > > design <- c(1,-1,1) fit <- lmFit(MA, design) fit <- eBayes(fit) topTable(fit)

Alternatively the design matrix could be set, replacing the ?rst of the above code lines, by
> design <- modelMatrix(targets, ref="wt")

where targets is the data frame holding the targets ?le information. If there are at least two arrays with each dye-orientation, then it is possible to estimate and adjust for any probe-speci?c dye e?ects. The dye-e?ect is estimated by an intercept term. If the experiment was FileName Cy3 Cy5 File1 wt mu File2 mu wt File3 wt mu File4 mu wt then we could set
> design <- cbind(DyeEffect=1,MUvsWT=c(1,-1,1,-1)) > fit <- lmFit(MA, design) > fit <- eBayes(fit)

The genes which show dye e?ects can be seen by
> topTable(fit, coef="DyeEffect")

The genes which are di?erentially expressed in the mutant are obtained by
> topTable(fit, coef="MUvsWT")

The fold changes and signi?cant tests in this list are corrected for dye-e?ects. Including the dye-e?ect in the model in this way uses up one degree of freedom which might otherwise be used to estimate the residual variability, but it is valuable if many genes show non-negligible dye-e?ects.

8.2

Technical Replication

In the previous sections we have assumed that all arrays are biological replicates. Now consider an experiment in which two wild-type and two mice from the same mutant strain are compared using two arrays for each pair of mice. The targets might be

34

FileName Cy3 Cy5 File1 wt1 mu1 File2 wt1 mu1 File3 wt2 mu2 File4 wt2 mu2 The ?rst and second and third and fourth arrays are technical replicates. It would not be correct to treat this experiment as comprising four replicate arrays because the technical replicate pairs are not independent, in fact they are likely to be positively correlated. One way to analyze these data is the following:
> > > > > biolrep <- c(1, 1, 2, 2) corfit <- duplicateCorrelation(MA, ndups = 1, block = biolrep) fit <- lmFit(MA, block = biolrep, cor = corfit$consensus) fit <- eBayes(fit) topTable(fit, adjust = "BH")

The vector biolrep indicates the two blocks corresponding to biological replicates. The value cofit$consensus estimates the average correlation within the blocks and should be positive. This analysis is analogous to mixed model analysis of variance [15, Chapter 18] except that information has been borrowed between genes. Information is borrowed by constraining the within-block correlations to be equal between genes and by using empirical Bayes methods to moderate the standard deviations between genes [22]. If the technical replicates were in dye-swap pairs as FileName File1 File2 File3 File4 then one might use
> > > > > design <- c(1, -1, 1, -1) corfit <- duplicateCorrelation(MA, design, ndups = 1, block = biolrep) fit <- lmFit(MA, design, block = biolrep, cor = corfit$consensus) fit <- eBayes(fit) topTable(fit, adjust = "BH")

Cy3 Cy5 wt1 mu1 mu1 wt1 wt2 mu2 mu2 wt2

In this case the correlation corfit$consensus should be negative because the technical replicates are dye-swaps and should vary in opposite directions. This method of handling technical replication using duplicateCorrelation() is somewhat limited. If for example one techical replicate was dye-swapped and the other not, FileName File1 File2 File3 File4 Cy3 Cy5 wt1 mu1 mu1 wt1 wt2 mu2 wt2 mu2 35

then there is no way to use duplicateCorrelation() because the technical replicate correlation will be negative for the ?rst pair but positive for the second. An alternative strategy is to include a coe?cient in the design matrix for each of the two biological blocks. This could be accomplished by de?ning
> design <- cbind(MU1vsWT1 = c(1,-1,0,0), MU2vsWT2 = c(0,0,1,1)) > fit <- lmFit(MA, design)

This will ?t a linear model with two coe?cients, one estimating the mutant vs wild-type comparison for the ?rst pair of mice and the other for the second pair of mice. What we want is the average of the two mutant vs wild-type comparisons, and this is extracted by the contrast (MU1vsWT1+MU2vsWT2)/2:
> + > > > cont.matrix <- makeContrasts(MUvsWT = (MU1vsWT1 + MU2vsWT2)/2, levels = design) fit2 <- contrasts.fit(fit, cont.matrix) fit2 <- eBayes(fit2) topTable(fit2, adjust = "BH")

The technique of including an e?ect for each biological replicate is well suited to situations with a lot of technical replication. Here is a larger example from a real experiment. Three mutant mice are to be compared with three wild-type mice. Eighteen two-color arrays were used with each mouse appearing on six di?erent arrays:
> targets FileName 1391.spot 1392.spot 1340.spot 1341.spot 1395.spot 1396.spot 1393.spot 1394.spot 1371.spot 1372.spot 1338.spot 1339.spot 1387.spot 1388.spot 1399.spot 1390.spot 1397.spot 1398.spot Cy3 wt1 mu1 wt2 mu1 wt3 mu1 wt1 mu2 wt2 mu2 wt3 mu2 wt1 mu3 wt2 mu3 wt3 mu3 Cy5 mu1 wt1 mu1 wt2 mu1 wt3 mu2 wt1 mu2 wt2 mu2 wt3 mu3 wt1 mu3 wt2 mu3 wt3

1391 1392 1340 1341 1395 1396 1393 1394 1371 1372 1338 1339 1387 1388 1399 1390 1397 1398

The comparison of interest is the average di?erence between the mutant and wild-type mice. duplicateCorrelation() could not be used here because the arrays do not group neatly into biological replicate groups. In any case, with six arrays on each mouse it is much safer and more conservative to ?t an e?ect for each mouse. We could proceed as 36

> design <- modelMatrix(targets, ref = "wt1") > design <- cbind(Dye = 1, design) > colnames(design) [1] "Dye" "mu1" "mu2" "mu3" "wt2" "wt3"

The above code treats the ?rst wild-type mouse as a baseline reference so that columns of the design matrix represent the di?erence between each of the other mice and wt1. The design matrix also includes an intercept term which represents the dye e?ect of Cy5 over Cy3 for each gene. If no dye e?ect is expected then the second line of code can be omitted.
> > + > > > fit <- lmFit(MA, design) cont.matrix <- makeContrasts(muvswt = (mu1+mu2+mu3-wt2-wt3)/3, levels = design) fit2 <- contrasts.fit(fit, cont.matrix) fit2 <- eBayes(fit2) topTable(fit2, adjust = "BH")

The contrast de?ned by the function makeContrasts represents the average di?erence between the mutant and wild-type mice, which is the comparison of interest. This general approach is applicable to many studies involving biological replicates. Here is another example based on a real example conducted by the Scott Lab at the Walter and Eliza Hall Institute (WEHI). RNA is collected from four human subjects from the same family, two a?ected by a leukemia-inducing mutation and two una?ected. Each of the two a?ected subjects (A1 and A2) is compared with each of the two una?ected subjects (U1 and U2): FileName Cy3 Cy5 File1 U1 A1 File2 A1 U2 File3 U2 A2 File4 A2 U1 Our interest is to ?nd genes which are di?erentially expressed between the a?ected and una?ected subjects. Although all four arrays compare an a?ected with an una?ected subject, the arrays are not independent. We need to take account of the fact that RNA from each subject appears on two di?erent arrays. We do this by ?tting a model with a coe?cient for each subject and then extracting the contrast between the a?ected and una?ected subjects:
> > > > > > design <- modelMatrix(targets, ref = "U1") fit <- lmFit(MA, design) cont.matrix <- makeContrasts(AvsU = (A1+A2-U2)/2, levels = design) fit2 <- contrasts.fit(fit, cont.matrix) fit2 <- eBayes(fit2) topTable(fit2, adjust = "BH")

37

8.3

Paired Samples

Paired samples occur when we compare two treatments and each sample given one treatment is naturally paired with a particular sample given the other treatment. This is a special case of blocking with blocks of size two. The classical test associated with this situation is the paired t-test. Suppose an experiment is conducted with A?ymetrix or single-channel arrays to compare a new treatment (T) with a control (C). Six dogs are used from three sib-ships. For each sib-pair, one dog is given the treatment while the other dog is a control. This produces the targets frame: FileName SibShip Treatment File1 1 C File2 1 T File3 2 C File4 2 T File5 3 C File6 3 T A moderated paired t-test can be computed by allowing for sib-pair e?ects in the linear model:
> > > > > > SibShip <- factor(targets$SibShip) Treat <- factor(targets$Treatment, levels=c("C","T")) design <- model.matrix(~SibShip+Treat) fit <- lmFit(eset, design) fit <- eBayes(fit) topTable(fit, coef="TreatT")

8.4

Two Groups: Common Reference

Suppose now that we wish to compare two wild type (Wt) mice with three mutant (Mu) mice using arrays hybridized with a common reference RNA (Ref): FileName Cy3 File1 Ref File2 Ref File3 Ref File4 Ref File5 Ref Cy5 WT WT Mu Mu Mu

The interest here is in the comparison between the mutant and wild type mice. There are two major ways in which this comparison can be made. We can 1. create a design matrix which includes a coe?cient for the mutant vs wild type di?erence, or

38

2. create a design matrix which includes separate coe?cients for wild type and mutant mice and then extract the di?erence as a contrast. For the ?rst approach, the design matrix should be as follows
> design WTvsREF MUvsWT 1 0 1 0 1 1 1 1 1 1

Array1 Array2 Array3 Array4 Array5

Here the ?rst coe?cient estimates the di?erence between wild type and the reference for each probe while the second coe?cient estimates the di?erence between mutant and wild type. For those not familiar with model matrices in linear regression, it can be understood in the following way. The matrix indicates which coe?cients apply to each array. For the ?rst two arrays the ?tted values will be just the WTvsREF coe?cient, which is correct. For the remaining arrays the ?tted values will be WTvsREF + MUvsWT, which is equivalent to mutant vs reference, also correct. For reasons that will be apparent later, this is sometimes called the treatment-contrasts parametrization. Di?erentially expressed genes can be found by
> fit <- lmFit(MA, design) > fit <- eBayes(fit) > topTable(fit, coef="MUvsWT", adjust="BH")

There is no need here to use contrasts.fit() because the comparison of interest is already built into the ?tted model. This analysis is analogous to the classical pooled two-sample t-test except that information has been borrowed between genes. For the second approach, the design matrix should be
Array1 Array2 Array3 Array4 Array5 WT MU 1 0 1 0 0 1 0 1 0 1

The ?rst coe?cient now represents wild-type vs the reference and the second represents mutant vs the reference. Our comparison of interest is the di?erence between these two coe?cients. We will call this the group-means parametrization. Di?erentially expressed genes can be found by
> > > > > fit <- lmFit(MA, design) cont.matrix <- makeContrasts(MUvsWT=WT-MU, levels=design) fit2 <- contrasts.fit(fit, cont.matrix) fit2 <- eBayes(fit2) topTable(fit2, adjust="BH")

39

The results will be exactly the same as for the ?rst approach. The design matrix can be constructed 1. manually, 2. using the limma function modelMatrix(), or 3. using the built-in R function model.matrix(). Let Group be the factor de?ned by
> Group <- factor(c("WT","WT","Mu","Mu","Mu"), levels=c("WT","Mu"))

For the ?rst approach, the treatment-contrasts parametrization, the design matrix can be computed by
> design <- cbind(WTvsRef=1,MUvsWT=c(0,0,1,1,1))

or by
> param <- cbind(WTvsRef=c(-1,1,0),MUvsWT=c(0,-1,1)) > rownames(param) <- c("Ref","WT","Mu") > design <- modelMatrix(targets, parameters=param)

or by
> design <- model.matrix(~Group) > colnames(design) <- c("WTvsRef","MUvsWT")

all of which produce the same result. For the second approach, the group-means parametrization, the design matrix can be computed by
> design <- cbind(WT=c(1,1,0,0,0),MU=c(0,0,1,1,1))

or by
> param <- cbind(WT=c(-1,1,0),MU=c(-1,0,1)) > rownames(param) <- c("Ref","WT","Mu") > design <- modelMatrix(targets, parameters=param)

or by
> design <- model.matrix(~0+Group) > colnames(design) <- c("WT","Mu")

all of which again produce the same result.

40

8.5

Two Groups: A?ymetrix

Suppose now that we wish to compare two wild type (Wt) mice with three mutant (Mu) mice using A?ymetrix arrays or any other single-channel array technology: FileName Target File1 WT File2 WT File3 Mu File4 Mu File5 Mu Everything is exactly as in the previous section, except that the function modelMatrix() would not be used. We can either 1. create a design matrix which includes a coe?cient for the mutant vs wild type di?erence, or 2. create a design matrix which includes separate coe?cients for wild type and mutant mice and then extract the di?erence as a contrast. For the ?rst approach, the treatment-contrasts parametrization, the design matrix should be as follows:
> design WT MUvsWT 1 0 1 0 1 1 1 1 1 1

Array1 Array2 Array3 Array4 Array5

Here the ?rst coe?cient estimates the mean log-expression for wild type mice and plays the role of an intercept. The second coe?cient estimates the di?erence between mutant and wild type. Di?erentially expressed genes can be found by
> fit <- lmFit(eset, design) > fit <- eBayes(fit) > topTable(fit, coef="MUvsWT", adjust="BH")

where eset is an exprSet or matrix object containing the log-expression values. For the second approach, the design matrix should be
Array1 Array2 Array3 Array4 Array5 WT MU 1 0 1 0 0 1 0 1 0 1

41

Di?erentially expressed genes can be found by
> > > > > fit <- lmFit(eset, design) cont.matrix <- makeContrasts(MUvsWT=WT-MU, levels=design) fit2 <- contrasts.fit(fit, cont.matrix) fit2 <- eBayes(fit2) topTable(fit2, adjust="BH")

For the ?rst approach, the treatment-contrasts parametrization, the design matrix can be computed by
> design <- cbind(WT=1,MUvsWT=c(0,0,1,1,1))

or by
> design <- model.matrix(~Group) > colnames(design) <- c("WT","MUvsWT")

For the second approach, the group-means parametrization, the design matrix can be computed by
> design <- cbind(WT=c(1,1,0,0,0,MU=c(0,0,1,1,1))

or by
> design <- model.matrix(~0+Group) > colnames(design) <- c("WT","MU")

8.6

Several Groups

The above approaches for two groups extend easily to any number of groups. Suppose that three RNA targets to be compared using A?ymetrixTM arrays. Suppose that the three targets are called “RNA1”, “RNA2” and “RNA3” and that the column targets$Target indicates which one was hybridized to each array. An appropriate design matrix can be created using
> f <- factor(targets$Target, levels=c("RNA1","RNA2","RNA3")) > design <- model.matrix(~0+f) > colnames(design) <- c("RNA1","RNA2","RNA3")

To make all pair-wise comparisons between the three groups one could proceed
> > + > > fit <- lmFit(eset, design) contrast.matrix <- makeContrasts(RNA2-RNA1, RNA3-RNA2, RNA3-RNA1, levels=design) fit2 <- contrasts.fit(fit, contrast.matrix) fit2 <- eBayes(fit2)

A list of top genes for RNA2 versus RNA1 can be obtained from
> topTable(fit2, coef=1, adjust="BH")

The outcome of each hypothesis test can be assigned using 42

> results <- decideTests(fit2)

A Venn diagram showing numbers of genes signi?cant in each comparison can be obtained from
> vennDiagram(results)

The statistic fit2$F and the corresponding fit2$F.p.value combine the three pair-wise comparisons into one F -test. This is equivalent to a one-way ANOVA for each gene except that the residual mean squares have been moderated between genes. To ?nd genes which vary between the three RNA targets in any way, look for genes with small p-values. To ?nd the top 30 genes:
> topTableF(fit2, number=30)

Now suppose that the experiment had been conducted using two-color arrays with a common reference instead of A?ymetrixTM arrays. For example the targets frame might be FileName File1 File2 File3 File4 File5 Cy3 Cy5 Ref RNA1 RNA1 Ref Ref RNA2 RNA2 Ref Ref RNA3

For this experiment the design matrix could be formed by
> design <- modelMatrix(targets, ref="Ref")

and everything else would be as for the A?ymetrixTM experiment.

8.7

Factorial Designs

Factorial designs are those where more than one experimental dimension is being varied and each combination of treatment conditions is observed. Suppose that cells are extracted from wild type and mutant mice and these cells are either stimulated (S) or unstimulated (U). RNA from the treated cells is then extracted and hybridized to a microarray. We will assume for simplicity that the arrays are single-color arrays such as A?ymetrix. Consider the following targets frame: FileName Strain Treatment File1 WT U File2 WT S File3 Mu U File4 Mu S File5 Mu S 43

The two experimental dimensions or factors here are Strain and Treatment. Strain speci?es the genotype of the mouse from which the cells are extracted and Treatment speci?es whether the cells are stimulated or not. All four combinations of Strain and Treatment are observed, so this is a factorial design. It will be convenient for us to collect the Strain/Treatment combinations into one vector as follows:
> TS <- paste(targets$Strain, targets$Treatment, sep=".") > TS [1] "WT.U" "WT.S" "Mu.U" "Mu.S" "Mu.S"

It is especially important with a factorial design to decide what are the comparisons of interest. We will assume here that the experimenter is interested in 1. which genes respond to stimulation in wild-type cells, 2. which genes respond to stimulation in mutant cells, and 3. which genes respond di?erently in mutant compared to wild-type cells. as these are the questions which are most usually relevant in a molecular biology context. The ?rst of these questions relates to the WT.S vs WT.U comparison and the second to Mu.S vs Mu.U. The third relates to the di?erence of di?erences, i.e., (Mu.S-Mu.U)-(WT.S-WT.U), which is called the interaction term. We describe ?rst a simple way to analyse this experiment using limma commands in a similar way to that in which two-sample designs were analyzed. Then we will go on to describe the more classical statistical approaches using factorial model formulas. All the approaches considered are equivalent and yield identical bottom-line results. The most basic approach is to ?t a model with a coe?cient for each of the four factor combinations and then to extract the comparisons of interest as contrasts:
> > > > TS <- factor(TS, levels=c("WT.U","WT.S","Mu.U","Mu.S")) design <- model.matrix(~0+TS) colnames(design) <- levels(TS) fit <- lmFit(eset, design)

This ?ts a model with four coe?cients corresponding to WT.U, WT.S, Mu.U and Mu.S respectively. Our three contrasts of interest can be extracted by
> cont.matrix <- makeContrasts( + WT.SvsU=WT.S-WT.U, + Mu.SvsU=Mu.S-Mu.U, + Diff=(Mu.S-Mu.U)-(WT.S-WT.U), + levels=design) > fit2 <- contrasts.fit(fit, cont.matrix) > fit2 <- eBayes(fit2)

We can use topTable() to look at lists of di?erentially expressed genes for each of three contrasts, or else 44

> results <- decideTests(fit2) > vennDiagram(results)

to look at all three contrasts simultaneously. The analysis of factorial designs has a long history in statistics and a system of factorial model formulas has been developed to facilitate the analysis of complex designs. It is important to understand though that the above three molecular biology questions do not correspond to any of the usual parametrizations used in statistics for factorial designs. Suppose for example that we proceed in the usual statistical way,
> Strain <- factor(targets$Strain, levels=c("WT","Mu")) > Treatment <- factor(targets$Treatment, levels=c("U","S")) > design <- model.matrix(~Strain*Treatment)

This creates a design matrix which de?nes four coe?cients with the following interpretations:
Coe?cient Intercept StrainMu TreatmentS StrainMu:TreatmentS Comparison WT.U Mu.U-WT.U WT.S-WT.U (Mu.S-Mu.U)-(WT.S-WT.U) Interpretation Baseline level of unstimulated WT Di?erence between unstimulated strains Stimulation e?ect for WT Interaction

This is called the treatment-contrast parametrization. Notice that one of our comparisons of interest, Mu.S-Mu.U, is not represented and instead the comparison Mu.U-WT.U, which might not be of direct interest, is included. We need to use contrasts to extract all the comparisons of interest:
> > > > fit <- lmFit(eset, design) cont.matrix <- cbind(WT.SvsU=c(0,0,1,0),Mu.SvsU=c(0,0,1,1),Diff=c(0,0,0,1)) fit2 <- contrasts.fit(fit, cont.matrix) fit2 <- eBayes(fit2)

This extracts the WT stimulation e?ect as the third coe?cient and the interaction as the fourth coe?cient. The mutant stimulation e?ect is extracted as the sum of the third and fourth coe?cients of the original model. This analysis yields exactly the same results as the previous analysis. An even more classical statistical approach to the factorial experiment would be to use the sum to zero parametrization. In R this is achieved by
> contrasts(Strain) <- contr.sum(2) > contrasts(Treatment) <- contr.sum(2) > design <- model.matrix(~Strain*Treatment)

This de?nes four coe?cients with the following interpretations:
Coe?cient Intercept Strain1 Treatment1 Strain1:Treatment1 Comparison (WT.U+WT.S+Mu.U+Mu.S)/4 (WT.U+WT.S-Mu.U-Mu.S)/4 (WT.U-WT.S+Mu.U-Mu.S)/4 (WT.U-WT.S-Mu.U+Mu.S)/4 Interpretation Grand mean Strain main e?ect Treatment main e?ect Interaction

45

This parametrization has many appealing mathematical properties and is the classical parametrization used for factorial designs in much experimental design theory. However it de?nes only one coe?cient which is directly of interest to us, namely the interaction. Our three contrasts of interest could be extracted using
> > > > fit <- lmFit(eset, design) cont.matrix <- cbind(WT.SvsU=c(0,0,-2,-2),Mu.SvsU=c(0,0,-2,2),Diff=c(0,0,0,4)) fit2 <- contrasts.fit(fit, cont.matrix) fit2 <- eBayes(fit2)

The results will be identical to those for the previous two approaches. The three approaches described here for the 2 × 2 factorial problem are equivalent and di?er only in the parametrization chosen for the linear model. The three ?tted model objects fit will di?er only in the coefficients and associated components. The residual standard deviations fit$sigma, residual degrees of freedom fit$df.residual and all components of fit2 will be identical for the three approaches. Since the three approaches are equivalent, users are free to choose whichever one is most convenient or intuitive.

8.8

Time Course Experiments

Time course experiments are those in which RNA is extracted at several time points after the onset of some treatment or stimulation. Simple time course experiments are similar to experiments with several groups covered in Section 8.6. Here we consider a two-way experiment in which time course pro?les are to be compared for two genotypes. Consider the targets frame FileName File1 File2 File3 File4 File5 File6 File7 File8 Target wt.0hr wt.0hr wt.6hr wt.24hr mu.0hr mu.0hr mu.6hr mu.24hr

The targets are RNA samples collected from wild-type and mutant animals at 0, 6 and 24 hour time points. This can be viewed as a factorial experiment but a simpler approach is to use the group-mean parametrization.
> > > > > lev <- c("wt.0hr","wt.6hr","wt.24hr","mu.0hr","mu.6hr","mu.24hr") f <- factor(targets$Target, levels=lev) design <- model.matrix(~0+f) colnames(design) <- lev fit <- lmFit(eset, design)

Which genes respond at either the 6 hour or 24 hour times in the wild-type? We can ?nd these by extracting the contrasts between the wild-type times. 46

> + + + > > >

cont.wt <- makeContrasts( "wt.6hr-wt.0hr", "wt.24hr-wt.6hr", levels=design) fit2 <- contrasts.fit(fit, cont.wt) fit2 <- eBayes(fit2) topTableF(fit2, adjust="BH")

Any two contrasts between the three times would give the same result. The same gene list would be obtained had "wt.24hr-wt.0hr" been used in place of "wt.24hr-wt.6hr" for example. Which genes respond in the mutant?
> + + + > > > cont.mu <- makeContrasts( "mu.6hr-mu.0hr", "mu.24hr-wt.6hr", levels=design) fit2 <- contrasts.fit(fit, cont.mu) fit2 <- eBayes(fit2) topTableF(fit2, adjust="BH")

Which genes respond di?erently in the mutant relative to the wild-type?
> + + + > > > cont.dif <- makeContrasts( Dif6hr =(mu.6hr-mu.0hr)-(wt.6hr-wt.0hr), Dif24hr=(mu.24hr-mu.6hr)-(wt.24hr-wt.6hr), levels=design) fit2 <- contrasts.fit(fit, cont.dif) fit2 <- eBayes(fit2) topTableF(fit2, adjust="BH")

The method of analysis described in this section was used for a six-point time course experiment on histone deacetylase inhibitors [16].

47

Chapter 9 Separate Channel Analysis of Two-Color Data
Consider an experiment comparing young and old animals for both both wild-type and mutant genotypes. FileName File1 File2 File3 File4 Cy3 Cy5 wt.young wt.old wt.old wt.young mu.young mu.old mu.old mu.young

Each of the arrays in this experiment makes a direct comparison between young and old RNA targets. There are no arrays which compare wild-type and mutant animals. This is an example of an unconnected design in that there are no arrays linking the wild-type and mutant targets. It is not possible to make comparisons between wild-type and mutant animals on the basis of log-ratios alone. So to do this it is necessary to analyse the red and green channels intensities separately, i.e., to analyze log-intensities instead of log-ratios. It is possible to do this using a mixed model representation which treats each spot as a randomized block [30, 21]. Limma implements mixed model methods for separate channel analysis which make use of shrinkage methods to ensure stable and reliable inference with small numbers of arrays [21]. Limma also provides between-array normalization to prepare for separate channel analysis, for example
> MA <- normalizeBetweenArrays(MA, method="Aquantile")

scales the intensities so that A-values have the same distribution across arrays. The ?rst step in the di?erential expression analysis is to convert the targets frame to be channel rather than array orientated.
> targets2 <- targetsA2C(targets) > targets2 channel.col FileName Target 1 File1 wt.young

File1.1

48

File1.2 File2.1 File2.2 File3.1 File3.2 File4.1 File4.2

2 1 2 1 2 1 2

File1 wt.old File2 wt.old File2 wt.young File3 mu.young File3 mu.old File4 mu.old File4 mu.young

The following code produces a design matrix with eight rows and four columns:
> > > > u <- unique(targets2$Target) f <- factor(targets2$Target, levels=u) design <- model.matrix(~0+f) colnames(design) <- u

Inference proceeds as for within-array replicate spots except that the correlation to be estimated is that between the two channels for the same spot rather than between replicate spots.
> corfit <- intraspotCorrelation(MA, design) > fit <- lmscFit(MA, design, correlation=corfit$consensus)

Subsequent steps proceed as for log-ratio analyses. For example if we want to campare wildtype young to mutant young animals, we could extract this contrast by
> > > > cont.matrix <- makeContrasts("mu.young-wt.young",levels=design) fit2 <- contrasts.fit(fit, cont.matrix) fit2 <- eBayes(fit2) topTable(fit2, adjust="BH")

49

Chapter 10 Statistics for Di?erential Expression
10.1 Summary Top-Tables

Limma provides functions topTable() and decideTests() which summarize the results of the linear model, perform hypothesis tests and adjust the p-values for multiple testing. Results include (log) fold changes, standard errors, t-statistics and p-values. The basic statistic used for signi?cance analysis is the moderated t-statistic, which is computed for each probe and for each contrast. This has the same interpretation as an ordinary t-statistic except that the standard errors have been moderated across genes, i.e., shrunk towards a common value, using a simple Bayesian model. This has the e?ect of borrowing information from the ensemble of genes to aid with inference about each individual gene [25]. Moderated t-statistics lead to p-values in the same way that ordinary t-statistics do except that the degrees of freedom are increased, re?ecting the greater reliability associated with the smoothed standard errors. The e?ectiveness of the moderated t approach has been demontrated on test data sets for which the di?erential expression status of each probe is known [10]. A number of summary statistics are presented by topTable() for the top genes and the selected contrast. The M -value (M) is the value of the contrast. Usually this represents a log2 fold change between two or more experimental conditions although sometimes it represents a log2 -expression level. The A-value (A) is the average log2 -expression level for that gene across all the arrays and channels in the experiment. Column t is the moderated t-statistic. Column P.Value is the associated p-value and adj.P.Value is the p-value adjusted for multiple testing. The most popular form of adjustment is "BH" which is Benjamini and Hochberg’s method to control the false discovery rate [1]. The adjusted values are often called q-values if the intention is to control or estimate the false discovery rate. The meaning of "BH" q-values is as follows. If all genes with q-value below a threshold, say 0.05, are selected as di?erentially expressed, then the expected proportion of false discoveries in the selected group is controled to be less than the theshold value, in this case 5%. This procedure is equivalent to the procedure of Benjamini and Hochberg although the original paper did not formulate the method in terms of adjusted p-values. The B-statistic (lods or B) is the log-odds that the gene is di?erentially expressed [25, Section 5]. Suppose for example that B = 1.5. The odds of di?erential expression is 50

exp(1.5)=4.48, i.e, about four and a half to one. The probability that the gene is di?erentially expressed is 4.48/(1+4.48)=0.82, i.e., the probability is about 82% that this gene is di?erentially expressed. A B-statistic of zero corresponds to a 50-50 chance that the gene is di?erentially expressed. The B-statistic is automatically adjusted for multiple testing by assuming that 1% of the genes, or some other percentage speci?ed by the user in the call to eBayes(), are expected to be di?erentially expressed. The p-values and B-statistics will normally rank genes in the same order. In fact, if the data contains no missing values or quality weights, then the order will be precisely the same. As with all model-based methods, the p-values depend on normality and other mathematical assumptions which are never exactly true for microarray data. It has been argued that the p-values are useful for ranking genes even in the presence of large deviations from the assumptions [24, 22]. Benjamini and Hochberg’s control of the false discovery rate assumes independence between genes, although Reiner et al [17] have argued that it works for many forms of dependence as well. The B-statistic probabilities depend on the same assumptions but require in addition a prior guess for the proportion of di?erentially expressed genes. The p-values may be preferred to the B-statistics because they do not require this prior knowledge. The eBayes() function computes one more useful statistic. The moderated F -statistic (F) combines the t-statistics for all the contrasts into an overall test of signi?cance for that gene. The F -statistic tests whether any of the contrasts are non-zero for that gene, i.e., whether that gene is di?erentially expressed on any contrast. The denominator degrees of freedom is the same as that of the moderated-t. Its p-value is stored as fit$F.p.value. It is similar to the ordinary F -statistic from analysis of variance except that the denominator mean squares are moderated across genes. A commonly asked question relates to the fact that it can be all of the adjusted p-values are equal to 1. This is not an error situation but rather an indication that there is no overall evidence of di?erential in the data after adjusting for multiple testing. This can occur even though many of the raw p-values may seem highly signi?cant when taken as individual values. This situation occurs when none of the raw p-values are less than 1/G, where G is the number of probes included in the ?t. This is true for any of the adjustment methods except for adjust="none".

10.2

Fitted Model Objects

The output from lmFit() is an object of class MArrayLM. This section gives some mathematical details describing what is contained in such objects. This section can be skipped by readers not interested in such details. 2 The linear model for gene j has residual variance σj with sample value s2 and degrees j of freedom dj . The output from lmFit(), fit say, holds the sj in component fit$sigma and 2 ? the dj in fit$df.residual. The covariance matrix of the estimated β j is σj C T (X T Vj X)?1 C where Vj is a weight matrix determined by prior weights, any covariance terms introduced by correlation structure and any iterative weights introduced by robust estimation. The squareroots of the diagonal elements of C T (X T Vj X)?1 C are called unscaled standard deviations and

51

are stored in fit$stdev.unscaled. The ordinary t-statistic for the kth contrast for gene j is ? tjk = βjk /(ujk sj ) where ujk is the unscaled standard deviation. The ordinary t-statistics can be recovered by
> tstat.ord <- fit$coef/fit$stdev.unscaled/fit$sigma

after ?tting a linear model if desired. 2 The empirical Bayes method assumes an inverse Chisquare prior for the σj with mean s2 0 and degrees of freedom d0 . The posterior values for the residual variances are given by s2 = ?j d0 s2 + dj s2 0 j d0 + dj

where dj is the residual degrees of freedom for the jth gene. The output from eBayes() contains s2 and d0 as fit$s2.prior and fit$df.prior and the s2 as fit$s2.post. The moderated t?j 0 statistic is ? βjk ? tjk = ujk sj ? This can be shown to follow a t-distribution on d0 + dj degrees of freedom if βjk = 0 [25]. The extra degrees of freedom f0 represent the extra information which is borrowed from the ensemble of genes for inference about each individual gene. The output from eBayes() contains ? the tjk as fit$t with corresponding p-values in fit$p.value.

52

Chapter 11 Case Studies
11.1 Swirl Zebra?sh: A Single-Sample Experiment

In this section we consider a case study in which two RNA sources are compared directly on a set of replicate or dye-swap arrays. The case study includes reading in the data, data display and exploration, as well as normalization and di?erential expression analysis. The analysis of di?erential expression is analogous to a classical one-sample test of location for each gene. In this example we assume that the data is provided as a GAL ?le called fish.gal and raw SPOT output ?les and that these ?les are in the current working directory. The data used for this case study can be downloaded from http://bioinf.wehi.edu.au/limmaGUI/ DataSets.html.
> dir() [1] "fish.gal" "swirl.1.spot" [6] "SwirlSample.txt" "swirl.2.spot" "swirl.3.spot" "swirl.4.spot"

Background. The experiment was carried out using zebra?sh as a model organism to study the early development in vertebrates. Swirl is a point mutant in the BMP2 gene that a?ects the dorsal/ventral body axis. The main goal of the Swirl experiment is to identify genes with altered expression in the Swirl mutant compared to wild-type zebra?sh. The hybridizations. Two sets of dye-swap experiments were performed making a total of four replicate hybridizations. Each of the arrays compares RNA from swirl ?sh with RNA from normal (“wild type”) ?sh. The experimenters have prepared a tab-delimited targets ?le called SwirlSamples.txt which describes the four hybridizations:
> library(limma) > targets <- readTargets("SwirlSample.txt") > targets SlideNumber FileName Cy3 Cy5 swirl.1 81 swirl.1.spot swirl wild type swirl.2 82 swirl.2.spot wild type swirl swirl.3 93 swirl.3.spot swirl wild type swirl.4 94 swirl.4.spot wild type swirl

Date 2001/9/20 2001/9/20 2001/11/8 2001/11/8

53

We see that slide numbers 81, 82, 93 and 94 were used to make the arrays. On slides 81 and 93, swirl RNA was labelled with green (Cy3) dye and wild type RNA was labelled with red (Cy5) dye. On slides 82 and 94, the labelling was the other way around. Each of the four hybridized arrays was scanned on an Axon scanner to produce a TIFF image, which was then processed using the image analysis software SPOT. The data from the arrays are stored in the four output ?les listed under FileName. Now we read the intensity data into an RGList object in R. The default for SPOT output is that Rmean and Gmean are used as foreground intensities and morphR and morphG are used as background intensities:
> RG <- read.maimages(targets$FileName, source="spot") Read swirl.1.spot Read swirl.2.spot Read swirl.3.spot Read swirl.4.spot > RG An object of class "RGList" $R swirl.1 swirl.2 swirl.3 swirl.4 [1,] 19538.470 16138.720 2895.1600 14054.5400 [2,] 23619.820 17247.670 2976.6230 20112.2600 [3,] 21579.950 17317.150 2735.6190 12945.8500 [4,] 8905.143 6794.381 318.9524 524.0476 [5,] 8676.095 6043.542 780.6667 304.6190 8443 more rows ... $G [1,] [2,] [3,] [4,] [5,] 8443 $Rb swirl.1 swirl.2 swirl.3 swirl.4 [1,] 174 136 82 48 [2,] 174 133 82 48 [3,] 174 133 76 48 [4,] 163 105 61 48 [5,] 140 105 61 49 8443 more rows ... $Gb swirl.1 swirl.2 swirl.3 swirl.4 [1,] 182 175 86 97 [2,] 171 183 86 85 [3,] 153 183 86 85 [4,] 153 142 71 87 [5,] 153 142 71 87 8443 more rows ... swirl.1 22028.260 25613.200 22652.390 8929.286 8746.476 more rows swirl.2 swirl.3 swirl.4 19278.770 2727.5600 19930.6500 21438.960 2787.0330 25426.5800 20386.470 2419.8810 16225.9500 6677.619 383.2381 786.9048 6576.292 901.0000 468.0476 ...

54

The arrays. The microarrays used in this experiment were printed with 8448 probes (spots), including 768 control spots. The array printer uses a print head with a 4x4 arrangement of print-tips and so the microarrays are partitioned into a 4x4 grid of tip groups. Each grid consists of 22x24 spots that were printed with a single print-tip. The gene name associated with each spot is recorded in a GenePix array list (GAL) ?le:
> RG$genes <- readGAL("fish.gal") > RG$genes[1:30,] Block Row Column ID Name 1 1 1 control geno1 1 1 2 control geno2 1 1 3 control geno3 1 1 4 control 3XSSC 1 1 5 control 3XSSC 1 1 6 control EST1 1 1 7 control geno1 1 1 8 control geno2 1 1 9 control geno3 1 1 10 control 3XSSC 1 1 11 control 3XSSC 1 1 12 control 3XSSC 1 1 13 control EST2 1 1 14 control EST3 1 1 15 control EST4 1 1 16 control 3XSSC 1 1 17 control Actin 1 1 18 control Actin 1 1 19 control 3XSSC 1 1 20 control 3XSSC 1 1 21 control 3XSSC 1 1 22 control 3XSSC 1 1 23 control Actin 1 1 24 control Actin 1 2 1 control ath1 1 2 2 control Cad-1 1 2 3 control DeltaB 1 2 4 control Dlx4 1 2 5 control ephrinA4 1 2 6 control FGF8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Because we are using SPOT output, the 4x4x22x24 print layout also needs to be set. The easiest way to do this is to infer it from the GAL ?le:
> RG$printer <- getLayout(RG$genes)

Image plots. It is interesting to look at the variation of background values over the array. Consider image plots of the red and green background for the ?rst array:
> imageplot(log2(RG$Rb[,1]), RG$printer, low="white", high="red") > imageplot(log2(RG$Gb[,1]), RG$printer, low="white", high="green")

55

Image plot of the un-normalized log-ratios or M-values for the ?rst array:
> MA <- normalizeWithinArrays(RG, method="none") > imageplot(MA$M[,1], RG$printer, zlim=c(-3,3))

56

The imageplot function lies the slide on its side, so the ?rst print-tip group is bottom left in this plot. We can see a red streak across the middle two grids of the 3rd row caused by a scratch or dust on the array. Spots which are a?ected by this artefact will have suspect M-values. The streak also shows up as darker regions in the background plots. MA-plots. An MA-plot plots the log-ratio of R vs G against the overall intensity of each spot. The log-ratio is represented by the M-value, M = log2 (R) ? log2 (G), and the overall intensity by the A-value, A = (log2 (R)+log2 (G))/2. Here is the MA-plot of the un-normalized values for the ?rst array:
> plotMA(MA)

The red streak seen on the image plot can be seen as a line of spots in the upper right of this plot. Now we plot the individual MA-plots for each of the print-tip groups on this array, together with the loess curves which will be used for normalization: 57

> plotPrintTipLoess(MA)

Normalization. Print-tip loess normalization:
> MA <- normalizeWithinArrays(RG) > plotPrintTipLoess(MA)

We have normalized the M-values with each array. A further question is whether normalization is required between the arrays. The following plot shows overall boxplots of the M-values for the four arrays.
> boxplot(MA$M~col(MA$M),names=colnames(MA$M))

58

There is evidence that the di?erent arrays have di?erent spreads of M-values, so we will scale normalize between the arrays. (Note this is not done routinely for all two-color data sets. It should only be done when there is good evidence of a scale di?erence.)
> MA <- normalizeBetweenArrays(MA,method="scale") > boxplot(MA$M~col(MA$M),names=colnames(MA$M))

Linear model. Now estimate the average M-value for each gene. We do this by ?tting a simple linear model for each gene. The negative numbers in the design matrix indicate the dye-swaps. 59

> design <- c(-1,1,-1,1) > fit <- lmFit(MA,design) > fit An object of class "MArrayLM" $coefficients [1] -0.3943421 -0.3656843 -0.3912506 -0.2505729 -0.3432590 8443 more rows ... $rank [1] 1 $assign NULL $qr $qr [,1] [1,] 2.0 [2,] -0.5 [3,] 0.5 [4,] -0.5 $qraux [1] 1.5 $pivot [1] 1 $tol [1] 1e-07 $rank [1] 1 attr(,"class") [1] "qr" $df.residual [1] 3 3 3 3 3 8443 more elements ... $sigma [1] 0.3805154 0.4047829 0.4672451 0.3206071 0.2838043 8443 more elements ... $cov.coefficients [,1] [1,] 0.25 $stdev.unscaled [1] 0.5 0.5 0.5 0.5 0.5 8443 more rows ...

60

$pivot [1] 1 $method [1] "ls" $design [,1] [1,] -1 [2,] 1 [3,] -1 [4,] 1 $genes Block Row Column 1 1 1 1 2 1 1 2 3 1 1 3 4 1 1 4 5 1 1 5 8443 more rows ...

ID control control control control control

Name geno1 geno2 geno3 3XSSC 3XSSC

$Amean [1] 13.51784 13.72765 13.48607 10.89367 10.97615 8443 more elements ...

In the above ?t object, coefficients is the average M-value for each gene and sigma is the sample standard deviations for each gene. Ordinary t-statistics for comparing mutant to wt could be computed by
> ordinary.t <- fit$coef / fit$stdev.unscaled / fit$sigma

We prefer though to use empirical Bayes moderated t-statistics which are computed below. Now create an MA-plot of the average M and A-values for each gene.
> plotMA(fit) > abline(0,0,col="blue")

61

Empirical Bayes analysis. We will now go on and compute empirical Bayes statistics for di?erential expression. The moderated t-statistics use sample standard deviations which have been shrunk towards a pooled standard deviation value.
> fit <- eBayes(fit) > qqt(fit$t,df=fit$df.prior+fit$df.residual,pch=16,cex=0.2) > abline(0,1)

Visually there seems to be plenty of genes which are di?erentially expressed. We will obtain a summary table of some key statistics for the top genes. 62

> options(digits=3) > topTable(fit,number=30,adjust="BH") Block Row Column ID Name M 3721 8 2 1 control BMP2 -2.21 1609 4 2 1 control BMP2 -2.30 3723 8 2 3 control Dlx3 -2.18 1611 4 2 3 control Dlx3 -2.18 8295 16 16 15 fb94h06 20-L12 1.27 7036 14 8 4 fb40h07 7-D14 1.35 515 1 22 11 fc22a09 27-E17 1.27 5075 10 14 11 fb85f09 18-G18 1.28 7307 14 19 11 fc10h09 24-H18 1.20 319 1 14 7 fb85a01 18-E1 -1.29 2961 6 14 9 fb85d05 18-F10 -2.69 4032 8 14 24 fb87d12 18-N24 1.27 6903 14 2 15 control Vox -1.26 4546 9 14 10 fb85e07 18-G13 1.23 683 2 7 11 fb37b09 6-E18 1.31 1697 4 5 17 fb26b10 3-I20 1.09 7491 15 5 3 fb24g06 3-D11 1.33 4188 8 21 12 fc18d12 26-F24 -1.25 4380 9 7 12 fb37e11 6-G21 1.23 3726 8 2 6 control fli-1 -1.32 2679 6 2 15 control Vox -1.25 2151 5 2 15 control vent -1.40 7602 15 9 18 fb50g12 9-L23 1.16 5931 12 6 3 fb32f06 5-C12 -1.10 3790 8 4 22 fb23d08 2-N16 1.16 7542 15 7 6 fb36g12 6-D23 1.12 4263 9 2 15 control vent -1.41 6375 13 2 15 control vent -1.37 157 1 7 13 fb38a01 6-I1 -1.82 1146 3 4 18 fb22a12 2-I23 1.05

A 12.1 13.1 13.3 13.5 12.0 13.8 13.2 14.4 13.4 12.5 10.3 14.2 13.4 14.2 13.3 13.3 13.6 12.1 14.0 10.3 13.4 12.7 14.0 13.0 12.5 13.5 12.7 12.5 10.8 13.7

t -21.0 -20.3 -20.0 -19.6 14.0 13.5 13.4 13.3 13.2 -13.0 -13.0 12.8 -12.8 12.7 12.4 12.3 12.3 -12.2 12.0 -11.9 -11.8 -11.7 11.7 -11.7 11.6 11.0 -10.8 -10.5 -10.2 10.2

P.Value 0.000333 0.000333 0.000333 0.000333 0.001993 0.001993 0.001993 0.001993 0.001993 0.001993 0.001993 0.001993 0.001993 0.001993 0.002087 0.002087 0.002087 0.002122 0.002133 0.002133 0.002133 0.002133 0.002133 0.002133 0.002144 0.002900 0.003160 0.003829 0.004124 0.004124

B 8.02 7.84 7.77 7.68 5.81 5.57 5.52 5.51 5.43 5.36 5.35 5.25 5.23 5.21 5.06 4.99 4.99 4.93 4.84 4.80 4.75 4.67 4.66 4.66 4.61 4.29 4.18 3.96 3.81 3.78

The top gene is BMP2 which is signi?cantly down-regulated in the Swirl zebra?sh, as it should be because the Swirl ?sh are mutant in this gene. Other positive controls also appear in the top 30 genes in terms. In the table, t is the empirical Bayes moderated t-statistic, the corresponding P-values have been adjusted to control the false discovery rate and B is the empirical Bayes log odds of di?erential expression.
> > > > plotMA(fit) ord <- order(fit$lods,decreasing=TRUE) top30 <- ord[1:30] text(fit$Amean[top30],fit$coef[top30],labels=fit$genes[top30,"Name"],cex=0.8,col="blue")

63

11.2

ApoAI Knockout Data: A Two-Sample Experiment

In this section we consider a case study where two RNA sources are compared through a common reference RNA. The analysis of the log-ratios involves a two-sample comparison of means for each gene. In this example we assume that the data is available as an RGList in the data ?le ApoAI.RData. The data used for this case study can be downloaded from http://bioinf. wehi.edu.au/limmaGUI/DataSets.html. Background. The data is from a study of lipid metabolism by [4]. The apolipoprotein AI (ApoAI) gene is known to play a pivotal role in high density lipoprotein (HDL) metabolism. Mice which have the ApoAI gene knocked out have very low HDL cholesterol levels. The purpose of this experiment is to determine how ApoAI de?ciency a?ects the action of other genes in the liver, with the idea that this will help determine the molecular pathways through which ApoAI operates. Hybridizations. The experiment compared 8 ApoAI knockout mice with 8 normal C57BL/6 (”black six”) mice, the control mice. For each of these 16 mice, target mRNA was obtained from liver tissue and labelled using a Cy5 dye. The RNA from each mouse was hybridized to a separate microarray. Common reference RNA was labelled with Cy3 dye and used for all the arrays. The reference RNA was obtained by pooling RNA extracted from the 8 control mice. Number of arrays 8 8 Red Normal “black six” mice ApoAI knockout Green Pooled reference Pooled reference

64

This is an example of a single comparison experiment using a common reference. The fact that the comparison is made by way of a common reference rather than directly as for the swirl experiment makes this, for each gene, a two-sample rather than a single-sample setup.
> load("ApoAI.RData") > objects() [1] "RG" > names(RG) [1] "R" "G" "Rb" "Gb" "printer" "genes" "targets" > RG$targets FileName Cy3 Cy5 c1 a1koc1.spot Pool C57BL/6 c2 a1koc2.spot Pool C57BL/6 c3 a1koc3.spot Pool C57BL/6 c4 a1koc4.spot Pool C57BL/6 c5 a1koc5.spot Pool C57BL/6 c6 a1koc6.spot Pool C57BL/6 c7 a1koc7.spot Pool C57BL/6 c8 a1koc8.spot Pool C57BL/6 k1 a1kok1.spot Pool ApoAI-/k2 a1kok2.spot Pool ApoAI-/k3 a1kok3.spot Pool ApoAI-/k4 a1kok4.spot Pool ApoAI-/k5 a1kok5.spot Pool ApoAI-/k6 a1kok6.spot Pool ApoAI-/k7 a1kok7.spot Pool ApoAI-/k8 a1kok8.spot Pool ApoAI-/> MA <- normalizeWithinArrays(RG) > cols <- MA$targets$Cy5 > cols[cols=="C57BL/6"] <- "blue" > cols[cols=="ApoAI-/-"] <- "yellow" > boxplot(MA$M~col(MA$M),names=rownames(MA$targets),col=cols,xlab="Mouse",ylab="M-values")

Since the common reference here is a pool of the control mice, we expect to see more di?erences from the pool for the knock-out mice than for the control mice. In terms of the above plot, 65

this should translate into a wider range of M-values for the knock-out mice arrays than for the control arrays, and we do see this. Since the di?erent arrays are not expected to have the same range of M-values, between-array scale normalization of the M-values is not appropriate here. Now we can go on to estimate the fold change between the two groups. In this case the design matrix has two columns. The coe?cient for the second column estimates the parameter of interest, the log-ratio between knockout and control mice.
> design <- cbind("Control-Ref"=1,"KO-Control"=MA$targets$Cy5=="ApoAI-/-") > design Control-Ref KO-Control [1,] 1 0 [2,] 1 0 [3,] 1 0 [4,] 1 0 [5,] 1 0 [6,] 1 0 [7,] 1 0 [8,] 1 0 [9,] 1 1 [10,] 1 1 [11,] 1 1 [12,] 1 1 [13,] 1 1 [14,] 1 1 [15,] 1 1 [16,] 1 1 > fit <- lmFit(MA, design) > fit$coef[1:5,] Control-Ref KO-Control [1,] -0.6595 0.6393 [2,] 0.2294 0.6552 [3,] -0.2518 0.3342 [4,] -0.0517 0.0405 [5,] -0.2501 0.2230 > fit <- eBayes(fit) > options(digits=3)

Normally at this point one would just type
> topTable(fit,coef=2)

However, the gene annotation is a bit wide for the printed page, so we will tell codetopTable() to show just one column of the annotation information:
> topTable(fit,coef=2,number=15,genelist=fit$genes$NAME) ProbeID M A t P.Value B 2149 ApoAI,lipid-Img -3.166 12.47 -23.98 2.98e-11 14.939 540 EST,HighlysimilartoA -3.049 12.28 -12.97 4.93e-07 10.825 5356 CATECHOLO-METHYLTRAN -1.848 12.93 -12.44 6.42e-07 10.457 4139 EST,WeaklysimilartoC -1.027 12.61 -11.75 1.21e-06 9.928

66

1739 2537 1496 4941 947 5604 4140 6073 1337 954 563

ApoCIII,lipid-Img ESTs,Highlysimilarto est similartoyeaststerol EST,WeaklysimilartoF APXL2,5q-Img estrogenrec psoriasis-associated Caspase7,heart-Img FATTYACID-BINDINGPRO

-0.933 -1.010 -0.977 -0.955 -0.571 -0.366 -0.420 0.421 -0.838 -0.302 -0.637

13.74 13.63 12.23 13.29 10.54 12.71 9.79 9.79 11.66 12.14 11.62

-9.83 -9.01 -9.00 -7.44 -4.55 -3.96 -3.93 3.91 -3.89 -3.85 -3.81

1.57e-05 4.21e-05 4.21e-05 5.59e-04 1.77e-01 5.27e-01 5.27e-01 5.27e-01 5.27e-01 5.29e-01 5.29e-01

8.192 7.307 7.292 5.314 0.563 -0.558 -0.621 -0.654 -0.683 -0.765 -0.837

Notice that the top gene is ApoAI itself which is heavily down-regulated. Theoretically the M-value should be minus in?nity for ApoAI because it is the knockout gene. Several of the other genes are closely related. The top eight genes here were con?rmed by independent assay subsequent to the microarray experiment to be di?erentially expressed in the knockout versus the control line.
> volcanoplot(fit,coef=2,highlight=8,names=fit$genes$NAME,main="KO vs Control")

11.3

Ecoli Lrp Data: A?ymetrix Data with Two Targets

The data are from experiments reported in [9] and are available from the www site http: //visitor.ics.uci.edu/genex/cybert/tutorial/index.html. The data is also available from the ecoliLeucine data package available from the Bioconductor www site under ”Experimental Data”. Hung et al [9] state that 67

The purpose of the work presented here is to identify the network of genes that are di?erentially regulated by the global E. coli regulatory protein, leucine-responsive regulatory protein (Lrp), during steady state growth in a glucose supplemented minimal salts medium. Lrp is a DNA-binding protein that has been reported to a?ect the expression of approximately 55 genes. Gene expression in two E. coli bacteria strains, labelled lrp+ and lrp-, were compared using eight A?ymetrix ecoli chips, four chips each for lrp+ and lrp-. The following code assumes that the data ?les for the eight chips are in your current working directory.
> dir() [1] "Ecoli.CDF" [4] "nolrp_3.CEL" [7] "wt_2.CEL" "nolrp_1.CEL" "nolrp_4.CEL" "wt_3.CEL" "nolrp_2.CEL" "wt_1.CEL" "wt_4.CEL"

The data is read and normalized using the a?y package. The package ecolicdf must also be installed, otherwise the rma() function will attempt to download and install it for you—without giving you to opportunity to veto the download.
> library(limma) > library(affy) Welcome to Bioconductor Vignettes contain introductory material. simply type: openVignette() For details on reading vignettes, see the openVignette help page. > Data <- ReadAffy() > eset <- rma(Data) Background correcting Normalizing Calculating Expression > pData(eset) sample nolrp_1.CEL 1 nolrp_2.CEL 2 nolrp_3.CEL 3 nolrp_4.CEL 4 wt_1.CEL 5 wt_2.CEL 6 wt_3.CEL 7 wt_4.CEL 8

To view,

Now we consider di?erential expression between the lrp+ and lrp- strains.
> > > > strain <- c("lrp-","lrp-","lrp-","lrp-","lrp+","lrp+","lrp+","lrp+") design <- model.matrix(~factor(strain)) colnames(design) <- c("lrp-","lrp+vs-") design lrp- lrp+vs1 1 0

68

2 1 0 3 1 0 4 1 0 5 1 1 6 1 1 7 1 1 8 1 1 attr(,"assign") [1] 0 1 attr(,"contrasts") attr(,"contrasts")$"factor(strain)" [1] "contr.treatment"

The ?rst coe?cient measures log2 -expression of each gene in the lrp- strain. The second coe?cient measures the log2 -fold change of lrp+ over lrp-, i.e., the log-fold change induced by lrp.
> > > > fit <- lmFit(eset, design) fit <- eBayes(fit) options(digits=2) topTable(fit, coef=2, n=40, adjust="BH") B 8.017 6.603 5.779 4.911 4.766 4.593 4.504 4.434 3.958 3.601 3.474 3.437 3.019 2.843 2.226 2.210 2.028 1.932 1.810 1.758 1.616 1.570 1.238 0.820 0.743 0.722 0.720 0.565 0.496

ProbeSetID M A t P.Value 4282 IG_821_1300838_1300922_fwd_st -3.32 12.4 -23.1 5.3e-05 5365 serA_b2913_st 2.78 12.2 15.8 6.0e-04 1389 gltD_b3213_st 3.03 10.9 13.3 1.6e-03 4625 lrp_b0889_st 2.30 9.3 11.4 4.0e-03 1388 gltB_b3212_st 3.24 10.1 11.1 4.0e-03 4609 livK_b3458_st 2.35 9.9 10.8 4.0e-03 4901 oppB_b1244_st -2.91 10.7 -10.6 4.0e-03 4903 oppD_b1246_st -1.94 10.4 -10.5 4.0e-03 5413 sodA_b3908_st 1.50 10.3 9.7 6.5e-03 4900 oppA_b1243_st -2.98 13.0 -9.1 9.2e-03 5217 rmf_b0953_st -2.71 13.6 -9.0 9.3e-03 7300 ytfK_b4217_st -2.64 11.1 -8.9 9.3e-03 5007 pntA_b1603_st 1.58 10.1 8.3 1.4e-02 4281 IG_820_1298469_1299205_fwd_st -2.45 10.7 -8.1 1.6e-02 4491 ilvI_b0077_st 0.95 10.0 7.4 2.9e-02 5448 stpA_b2669_st 1.79 10.0 7.4 2.9e-02 611 b2343_st -2.12 10.8 -7.1 3.4e-02 5930 ybfA_b0699_st -0.91 10.5 -7.0 3.5e-02 1435 grxB_b1064_st -0.91 9.8 -6.9 3.8e-02 4634 lysU_b4129_st -3.30 9.3 -6.9 3.9e-02 4829 ndk_b2518_st 1.07 11.1 6.7 4.3e-02 2309 IG_1643_2642304_2642452_rev_st 0.83 9.6 6.7 4.3e-02 4902 oppC_b1245_st -2.15 10.7 -6.3 5.9e-02 4490 ilvH_b0078_st 1.11 9.9 5.9 8.8e-02 1178 fimA_b4314_st 3.40 11.7 5.9 8.8e-02 6224 ydgR_b1634_st -2.35 9.8 -5.8 8.8e-02 4904 oppF_b1247_st -1.46 9.9 -5.8 8.8e-02 792 b3914_st -0.77 9.5 -5.7 1.0e-01 5008 pntB_b1602_st 1.47 12.8 5.6 1.0e-01

69

4610 5097 4886 4898 5482 1927 6320 196 954 1186 4013

livM_b3456_st ptsG_b1101_st nupC_b2393_st ompT_b0565_st tdh_b3616_st IG_13_14080_14167_fwd_st yeeF_b2014_st atpG_b3733_st cydB_b0734_st fimI_b4315_st IG_58_107475_107629_fwd_st

1.04 1.16 0.79 2.67 -1.61 -0.55 0.88 0.60 -0.76 1.15 -0.49

8.5 12.2 9.6 10.5 10.5 8.4 9.9 12.5 11.0 8.3 10.4

5.5 5.5 5.5 5.4 -5.3 -5.3 5.3 5.2 -5.0 5.0 -4.9

1.1e-01 1.1e-01 1.1e-01 1.2e-01 1.3e-01 1.3e-01 1.3e-01 1.4e-01 1.8e-01 1.8e-01 2.0e-01

0.376 0.352 0.333 0.218 0.092 0.076 0.065 -0.033 -0.272 -0.298 -0.407

The column M gives the log2 -fold change while the column A gives the average log2 -intensity for the probe-set. Positive M-values mean that the gene is up-regulated in lrp+, negative values mean that it is repressed. It is interesting to compare this table with Tables III and IV in [9]. Note that the topranked gene is an intergenic region (IG) tRNA gene. The knock-out gene itself is in position four. Many of the genes in the above table, including the ser, glt, liv, opp, lys, ilv and ?m families, are known targets of lrp.

11.4

Estrogen Data: A 2x2 Factorial Experiment with A?ymetrix Arrays

This data is from the estrogen package on Bioconductor. A subset of the data is also analysed in the factDesign package vignette. To repeat this case study you will need to have the R packages a?y, estrogen and hgu95av2cdf installed. The data gives results from a 2x2 factorial experiment on MCF7 breast cancer cells using A?ymetrix HGU95av2 arrays. The factors in this experiment were estrogen (present or absent) and length of exposure (10 or 48 hours). The aim of the study is the identify genes which respond to estrogen and to classify these into early and late responders. Genes which respond early are putative direct-target genes while those which respond late are probably downstream targets in the molecular pathway. First load the required packages:
> library(limma) > library(affy) Welcome to Bioconductor Vignettes contain introductory material. simply type: openVignette() For details on reading vignettes, see the openVignette help page. > library(hgu95av2cdf)

To view,

The data ?les are contained in the extdata directory of the estrogen package:
> datadir <- file.path(.find.package("estrogen"),"extdata") > dir(datadir)

70

[1] "00Index" "bad.cel" [6] "high48-2.cel" "low10-1.cel" [11] "phenoData.txt"

"high10-1.cel" "low10-2.cel"

"high10-2.cel" "low48-1.cel"

"high48-1.cel" "low48-2.cel"

The targets ?le is called phenoData.txt. We see there are two arrays for each experimental condition, giving a total of 8 arrays.
> targets <- readTargets("phenoData.txt",path=datadir,sep="",row.names="filename") > targets filename estrogen time.h low10-1 low10-1.cel absent 10 low10-2 low10-2.cel absent 10 high10-1 high10-1.cel present 10 high10-2 high10-2.cel present 10 low48-1 low48-1.cel absent 48 low48-2 low48-2.cel absent 48 high48-1 high48-1.cel present 48 high48-2 high48-2.cel present 48

Now read the cel ?les into an A?yBatch object and normalize using the rma() function from the a?y package:
> ab <- ReadAffy(filenames=targets$filename, celfile.path=datadir) > eset <- rma(ab) Background correcting Normalizing Calculating Expression

There are many ways to construct a design matrix for this experiment. Given that we are interested in the early and late estrogen responders, we can choose a parametrization which includes these two contrasts.
> > > > treatments <- factor(c(1,1,2,2,3,3,4,4),labels=c("e10","E10","e48","E48")) contrasts(treatments) <- cbind(Time=c(0,0,1,1),E10=c(0,1,0,0),E48=c(0,0,0,1)) design <- model.matrix(~treatments) colnames(design) <- c("Intercept","Time","E10","E48")

The second coe?cient picks up the e?ect of time in the absence of estrogen. The third and fourth coe?cients estimate the log2 -fold change for estrogen at 10 hours and 48 hours respectively.
> fit <- lmFit(eset,design)

We are only interested in the estrogen e?ects, so we choose a contrast matrix which picks these two coe?cients out:
> cont.matrix <- cbind(E10=c(0,0,1,0),E48=c(0,0,0,1)) > fit2 <- contrasts.fit(fit, cont.matrix) > fit2 <- eBayes(fit2)

71

We can examine which genes respond to estrogen at either time using the moderated Fstatistics on 2 degrees of freedom. The moderated F p-value is stored in the component fit2$F.p.value. What p-value cuto? should be used? One way to decide which changes are signi?cant for each gene would be to use Benjamini and Hochberg’s method to control the false discovery rate across all the genes and both tests:
> results <- decideTests(fit2, method="global")

Another method would be to adjust the F-test p-values rather than the t-test p-values:
> results <- decideTests(fit2, method="nestedF")

Here we use a more conservative method which depends far less on distributional assumptions, which is to make use of control and spike-in probe-sets which theoretically should not be di?erentially-expressed. The smallest p-value amongst these contols turns out to be about 0.00014:
> i <- grep("AFFX",geneNames(eset)) > summary(fit2$F.p.value[i]) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0001391 0.1727000 0.3562000 0.4206000 0.6825000 0.9925000

So a cuto? p-value of 0.0001, say, would conservatively avoid selecting any of the control probe-sets as di?erentially expressed:
> results <- classifyTestsF(fit2, p.value=0.0001) > summary(results) E10 E48 -1 40 76 0 12469 12410 1 116 139 > table(E10=results[,1],E48=results[,2]) E48 E10 -1 0 1 -1 29 11 0 47 12370 1 0 29

0 52 87

> vennDiagram(results,include="up")

72

> vennDiagram(results,include="down")

We see that 87 genes were up regulated at both 10 and 48 hours, 29 only at 10 hours and 52 only at 48 hours. Also, 29 genes were down-regulated throughout, 11 only at 10 hours and 47 only at 48 hours. No genes were up at one time and down at the other. topTable gives a detailed look at individual genes. The leading genes are clearly signi?cant, even using the default p-value adjustment method, which is the highly conservative Holm’s method.
> options(digits=3) > topTable(fit2,coef="E10",n=20) ID M A t P.Value B 9735 39642_at 2.94 7.88 23.7 5.99e-05 9.97 12472 910_at 3.11 9.66 23.6 6.26e-05 9.94 1814 31798_at 2.80 12.12 16.4 1.29e-03 7.98

73

11509 41400_at 2.38 10.04 16.2 1.41e-03 7.92 10214 40117_at 2.56 9.68 15.7 1.86e-03 7.70 953 1854_at 2.51 8.53 15.2 2.46e-03 7.49 9848 39755_at 1.68 12.13 15.1 2.59e-03 7.45 922 1824_s_at 1.91 9.24 14.9 2.86e-03 7.37 140 1126_s_at 1.78 6.88 13.8 5.20e-03 6.89 580 1536_at 2.66 5.94 13.3 7.30e-03 6.61 12542 981_at 1.82 7.78 13.1 8.14e-03 6.52 3283 33252_at 1.74 8.00 12.6 1.12e-02 6.25 546 1505_at 2.40 8.76 12.5 1.20e-02 6.19 4405 34363_at -1.75 5.55 -12.2 1.44e-02 6.03 985 1884_s_at 2.80 9.03 12.1 1.59e-02 5.95 6194 36134_at 2.49 8.28 11.8 1.90e-02 5.79 7557 37485_at 1.61 6.67 11.4 2.50e-02 5.55 1244 239_at 1.57 11.25 10.4 5.14e-02 4.90 8195 38116_at 2.32 9.51 10.4 5.16e-02 4.90 10634 40533_at 1.26 8.47 10.4 5.31e-02 4.87 > topTable(fit2,coef="E48",n=20) ID M 12472 910_at 3.86 1814 31798_at 3.60 953 1854_at 3.34 8195 38116_at 3.76 8143 38065_at 2.99 9848 39755_at 1.77 642 1592_at 2.30 11509 41400_at 2.24 3766 33730_at -2.04 732 1651_at 2.97 8495 38414_at 2.02 1049 1943_at 2.19 10214 40117_at 2.28 10634 40533_at 1.64 9735 39642_at 1.61 4898 34851_at 1.96 922 1824_s_at 1.64 6053 35995_at 2.76 12455 893_at 1.54 10175 40079_at -2.41 A t P.Value B 9.66 29.2 1.04e-05 11.61 12.12 21.1 1.62e-04 9.89 8.53 20.2 2.29e-04 9.64 9.51 16.9 1.02e-03 8.48 9.10 16.2 1.42e-03 8.21 12.13 15.8 1.72e-03 8.05 8.31 15.8 1.76e-03 8.03 10.04 15.3 2.29e-03 7.81 8.57 -15.1 2.48e-03 7.74 10.50 14.8 3.02e-03 7.57 9.46 14.6 3.36e-03 7.48 7.60 14.0 4.69e-03 7.18 9.68 14.0 4.79e-03 7.16 8.47 13.5 6.24e-03 6.93 7.88 13.0 8.46e-03 6.65 9.96 12.8 9.47e-03 6.55 9.24 12.8 1.00e-02 6.50 8.87 12.7 1.05e-02 6.46 10.95 12.7 1.06e-02 6.45 8.23 -12.6 1.09e-02 6.42

11.5

Weaver Mutant Data: A 2x2 Factorial Experiment with Two-Color Data

This case study considers a more involved analysis in which the sources of RNA have a factorial structure. Background. This is a case study examining the development of certain neurons in wild-type and weaver mutant mice from [6]. The weaver mutant a?ects cerebellar granule neurons, the 74

most numerous cell-type in the central nervous system. Weaver mutant mice are characterized by a weaving gait. Granule cells are generated in the ?rst postnatal week in the external granule layer of the cerebellum. In normal mice, the terminally di?erentiated granule cells migrate to the internal granule layer but in mutant mice the cells die before doing so, meaning that the mutant mice have strongly reduced numbers of cells in the internal granule layer. The expression level of any gene which is speci?c to mature granule cells, or is expressed in response to granule cell derived signals, is greatly reduced in the mutant mice. Tissue dissection and RNA preparation. At each time point (P11 = 11 days postnatal and P21 = 21 days postnatal) cerebella were isolated from two wild-type and two mutant littermates and pooled for RNA isolation. RNA was then divided into aliquots and labelled before hybridizing to the arrays. (This means that di?erent hybridizations are biologically related through using RNA from the same mice, although we will ignore this here. See Yang and Speed (2002) for a detailed discussion of this issue in the context of this experiment.) Hybridizations. There are four di?erent treatment combinations, P11wt, P11mt, P21wt and P21mt, which might think of as a 2x2 factorial structure. We consider ten arrays in total. There are six arrays comparing the four di?erent RNA sources to a common reference, which was a pool of RNA from all the time points, and four arrays making direct comparisons between the four treatment combinations. First read in the data. We assume that the data is an directory called c:/Weaver. The data used for this case study can be downloaded from http://bioinf.wehi.edu.au/limma/ data/weaver.zip. We ?rst read in the targets frame, and then read the intensity data using ?le names recorded in the targets ?le. The data was produced using SPOT image analysis software and is stored in the subdirectory /spot. Notice that a spot quality weight function as been set. For these arrays the median spot area is just over 50 pixels. The spot quality function has been set so that any spot with an area less than 50 pixels will get reduced weight, so that a hypothetical spot of zero area would get zero weight.
> library(limma) > targets <- readTargets("targets.txt") > targets FileName Tissue Mouse Cy5 Cy3 cbmut.3 cbmut.3.spot Cerebellum Weaver P11wt Pool cbmut.4 cbmut.4.spot Cerebellum Weaver P11mt Pool cbmut.5 cbmut.5.spot Cerebellum Weaver P21mt Pool cbmut.6 cbmut.6.spot Cerebellum Weaver P21wt Pool cbmut.15 cbmut.15.spot Cerebellum Weaver P21wt Pool cbmut.16 cbmut.16.spot Cerebellum Weaver P21mt Pool cb.1 cb.1.spot Cerebellum Weaver P11wt P11mt cb.2 cb.2.spot Cerebellum Weaver P11mt P21mt cb.3 cb.3.spot Cerebellum Weaver P21mt P21wt cb.4 cb.4.spot Cerebellum Weaver P21wt P11wt > wtfun <- function(x) pmin(x$area/50, 1) > RG <- read.maimages(targets$FileName, source = "spot", path = "spot", wt.fun = wtfun)

75

Read Read Read Read Read Read Read Read Read Read

spot/cbmut.3.spot spot/cbmut.4.spot spot/cbmut.5.spot spot/cbmut.6.spot spot/cbmut.15.spot spot/cbmut.16.spot spot/cb.1.spot spot/cb.2.spot spot/cb.3.spot spot/cb.4.spot

The SPOT software does not store probe IDs in the output ?les, so we need to read in the ID and annotation information separately. We also read in a spottypes ?le and set a range of control spots.
> RG$genes <- read.delim("genelist.txt", header = TRUE, as.is = TRUE) > RG$printer <- list(ngrid.r = 8, ngrid.c = 4, nspot.r = 25, nspot.c = 24) > spottypes <- readSpotTypes("spottypes.txt")

Note for R 2.1.0 or 2.1.1. There is a bug in the read.table function of R 2.1 which prevents readSpotTypes() from reading in the backslashes in the Name column of the ?le spottypes.txt correctly. If you are reproducing this data example in R 2.1, you should insert the commands
> spottypes$Name <- gsub("\\(","\\\\(",spottypes$Name) > spottypes$Name <- gsub("\\)","\\\\)",spottypes$Name)

at this point as a work-around.
> spottypes SpotType ID Name col cex 1 Riken * * black 0.2 2 Custom Control * black 1.0 3 Buffer Control 3x SSC yellow 1.0 4 CerEstTitration Control cer est \\(* lightblue 1.0 5 LysTitration Control Lys \\(* orange 1.0 6 PheTitration Control Phe \\(* orange 1.0 7 RikenTitration Control Riken est \\(* blue 1.0 8 ThrTitration Control Thr \\(* orange 1.0 9 18S Control 18S \\(0.15ug/ul\\) pink 1.0 10 GAPDH Control GAPDH \\(0.15 ug/ul\\) red 1.0 11 Lysine Control Lysine \\(0.2 ug/ul\\) magenta 1.0 12 Threonine Control Threonine \\(0.2ug/ul\\) lightgreen 1.0 13 Tubulin Control Tubulin \\(0.15 ug/ul\\) green 1.0 > RG$genes$Status <- controlStatus(spottypes, RG) Matching patterns for: ID Name Found 19200 Riken Found 2304 Custom

76

Found 710 Buffer Found 192 CerEstTitration Found 224 LysTitration Found 260 PheTitration Found 160 RikenTitration Found 224 ThrTitration Found 64 18S Found 64 GAPDH Found 32 Lysine Found 32 Threonine Found 64 Tubulin Setting attributes: values col cex > plotMA(RG,array=9,xlim=c(4,15.5))

Here Bu?er is an obvious negative control while 18S, GAPDH, Lysine, Threonine and Tubulin are single-gene positive controls, sometime called house-keeping genes. RikenTitration is a titration series of a pool of the entire Riken library, and can be reasonably expected to be non-di?erentially expressed. CerEstTitration is a titration of a pool of a cerebellum EST library. This will show higher expression in later mutant tissues. The Lys, Phe and Thr series are single-gene titration series which were not spike-in in this case and can be treated as negative controls. Now normalize the data. Because the Riken titration library, being based on a pool of a large number of non-speci?c genes, should not be di?erentially expressed, we up-weight these spots in the print-tip normalization step:
> w <- modifyWeights(RG$weights, RG$genes$Status, "RikenTitration", 2) > MA <- normalizeWithinArrays(RG, weights = w)

Now ?t a linear model to the data. Because of the composite design, with some common reference arrays and some direct comparison arrays, the simplest method is to use a groupmean parametrization with all RNA samples compared back to the Pool. 77

> design <- modelMatrix(targets, ref = "Pool") Found unique target names: P11mt P11wt P21mt P21wt Pool > design P11mt P11wt P21mt P21wt 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 -1 1 0 0 1 0 -1 0 0 0 1 -1 0 -1 0 1

cbmut.3 cbmut.4 cbmut.5 cbmut.6 cbmut.15 cbmut.16 cb.1 cb.2 cb.3 cb.4

All the control spots are removed before ?tting the linear model:
> isGene <- MA$genes$Status == "Riken" > fit <- lmFit(MA[isGene, ], design)

We now extract all possible comparisons of interest as contrasts. We look for the mutant vs wt comparisons at 11 and 21 days, the time e?ects for mutant and wt, and the interaction terms:
> cont.matrix <- makeContrasts( + WT11.MT11=P11mt-P11wt, + WT21.MT21=P21mt-P21wt, + WT11.WT21=P21wt-P11wt, + MT11.MT21=P21mt-P11mt, + Int=(P21mt-P11mt)-(P21wt-P11wt), + levels=design) > fit2 <- contrasts.fit(fit, cont.matrix) > fit2 <- eBayes(fit2)

Adjustment for multiple testing, with Benjamini and Hochberg’s method applied to the F-test p-values across genes with 5% false discovery rate, and nesting F-testing used within contrasts, leads to the following:
> results <- decideTests(fit2, method = "nestedF") > summary(results) WT11.MT11 WT21.MT21 WT11.WT21 MT11.MT21 Int 3 29 132 136 14 16884 16788 16653 16571 16846 9 79 111 189 36

-1 0 1

78

There are 187 genes up and 136 genes down in mutant at 21 days vs 11 days. There are 36 genes which respond more up in the mutant than the wt, and 14 genes which respond more down in the mutant than the wt. A heatdiagram shows that the genes are mostly responding in the same direction in the mutant and wt, but to di?erent degrees:
> heatDiagram(results, fit2$coef, primary = "Int")

Int MT11.MT21 WT11.WT21 WT21.MT21 WT11.MT11 13718 10916 14395 495 12532 10300 8030 15001 14288 4741 13236 7797 13520 8993 14250 3968 2129 2630 11521 782 9081 4337 10328 3427 829 2046 12131 14225 11441 4191 3590 15237 12034 13968 13080 11201 5469 1435 16472 10150 3920 5349 7404 15535 1184 11741 3258 14898 14399 13843

11.6

Bob Mutant Data: Within-Array Replicate Spots

In this section we consider a case study in which all genes (ESTs and controls) are printed more than once on the array. This means that there is both within-array and between-array replication for each gene. The structure of the experiment is therefore essentially a randomized block experiment for each gene. The approach taken here is to estimate a common correlation for all the genes for between within-array duplicates. The theory behind the approach is explained in [22]. This approach assumes that all genes are replicated the same number of times on the array and that the spacing between the replicates is entirely regular. In this example we assume that the data is available as an RG list. Background. This data is from a study of transcription factors critical to B cell maturation by Lynn Corcoran and Wendy Dietrich at the WEHI. Mice which have a targeted mutation in the Bob (OBF-1) transcription factor display a number of abnormalities in the B lymphocyte compartment of the immune system. Immature B cells that have emigrated from the bone marrow fail to di?erentiate into full ?edged B cells, resulting in a notable de?cit of mature B cells. Arrays. Arrays were printed at the Australian Genome Research Facility with expressed sequence tags (ESTs) from the National Institute of Aging 15k mouse clone library, plus a range of positive, negative and calibration controls. The arrays were printed using a 48 tip print head and 26x26 spots in each tip group. Data from 24 of the tip groups are given here. Every gene (ESTs and controls) was printed twice on each array, side by side by rows. The NIA15k probe IDs have been anonymized in the output presented here. Hybridizations. A retrovirus was used to add Bob back to a Bob de?cient cell line. Two RNA sources were compared using 2 dye-swap pairs of microarrays. One RNA source was obtained from the Bob de?cient cell line after the retrovirus was used to add GFP (”green 79

?uorescent protein”, a neutral protein). The other RNA source was obtained after adding both GFP and Bob protein. RNA from Bob+GFP was labelled with Cy5 in arrays 2 and 4, and with Cy3 in arrays 1 and 4. Image analysis. The arrays were image analyzed using SPOT with “morph” background estimation. The data used for this case study can be downloaded from http://bioinf.wehi.edu.au/ limma/data/Bob.RData. The ?le should be placed in the working directory of your R session. (This case study was last updated on 29 June 2006 using R 2.3.0 and limma 2.7.5.)
> library(limma) > load("Bob.RData") > objects() [1] "design" "RG" > design [1] -1 1 -1 1 > names(RG) [1] "R" "G" "Rb" > RG$genes[1:40,] Library ID 1 Control cDNA1.500 2 Control cDNA1.500 3 Control Printing.buffer 4 Control Printing.buffer 5 Control Printing.buffer 6 Control Printing.buffer 7 Control Printing.buffer 8 Control Printing.buffer 9 Control cDNA1.500 10 Control cDNA1.500 11 Control Printing.buffer 12 Control Printing.buffer 13 Control Printing.buffer 14 Control Printing.buffer 15 Control Printing.buffer 16 Control Printing.buffer 17 Control cDNA1.500 18 Control cDNA1.500 19 Control Printing.buffer 20 Control Printing.buffer 21 Control Printing.buffer 22 Control Printing.buffer 23 Control Printing.buffer 24 Control Printing.buffer 25 Control cDNA1.500 26 Control cDNA1.500 27 NIA15k H31 28 NIA15k H31 29 NIA15k H32 30 NIA15k H32 31 NIA15k H33 32 NIA15k H33

"Gb"

"genes"

"printer"

80

33 34 35 36 37 38 39 40

NIA15k NIA15k NIA15k NIA15k NIA15k NIA15k NIA15k NIA15k

H34 H34 H35 H35 H36 H36 H37 H37

Although there are only four arrays, we have a total of eight spots for each gene, and more for the controls. Naturally the two M-values obtained from duplicate spots on the same array are highly correlated. The problem is how to make use of the duplicate spots in the best way. The approach taken here is to estimate the spatial correlation between the adjacent spots using REML and then to conduct the usual analysis of the arrays using generalized least squares. First normalize the data using print-tip loess regression. The SPOT morph background ensures that the default background subtraction can be used without inducing negative intensities.
> MA <- normalizeWithinArrays(RG)

Then remove the control probes:
> MA2 <- MA[MA$genes$Library=="NIA15k", ]

Now estimate the spatial correlation. We estimate a correlation term by REML for each gene, and then take a trimmed mean on the atanh scale to estimate the overall correlation. This command will probably take at least a few minutes depending on the speed of your computer.
> corfit <- duplicateCorrelation(MA2,design,ndups=2) # A slow computation! Loading required package: statmod Attaching package: ’statmod’

The following object(s) are masked from package:limma : matvec vecmat > corfit$consensus.correlation [1] 0.5749556 > boxplot(tanh(corfit$atanh.correlations))

81

fit <- lmFit(MA2,design,ndups=2,correlation=corfit$consensus) fit <- eBayes(fit) options(digits=3) topTable(fit,n=30,adjust="BH") Library ID M A t P.Value adj.P.Val B 4599 NIA15k H34599 0.404 10.93 12.92 1.34e-07 0.000443 8.02 1324 NIA15k H31324 -0.520 7.73 -12.24 2.23e-07 0.000443 7.57 3309 NIA15k H33309 0.420 10.99 11.99 2.71e-07 0.000443 7.39 440 NIA15k H3440 0.568 9.96 11.64 3.60e-07 0.000443 7.13 6795 NIA15k H36795 0.460 10.78 11.56 3.85e-07 0.000443 7.07 121 NIA15k H3121 0.441 10.48 11.31 4.73e-07 0.000443 6.88 2838 NIA15k H32838 1.640 12.74 11.26 4.92e-07 0.000443 6.84 6999 NIA15k H36999 0.381 9.91 11.19 5.21e-07 0.000443 6.79 132 NIA15k H3132 0.370 10.10 11.17 5.31e-07 0.000443 6.77 6207 NIA15k H36207 -0.393 8.53 -11.06 5.82e-07 0.000443 6.69 7168 NIA15k H37168 0.391 9.88 10.76 7.51e-07 0.000493 6.45 1831 NIA15k H31831 -0.374 9.62 -10.63 8.41e-07 0.000493 6.35 2014 NIA15k H32014 0.363 9.65 10.49 9.54e-07 0.000493 6.23 7558 NIA15k H37558 0.532 11.42 10.49 9.58e-07 0.000493 6.22 4471 NIA15k H34471 -0.353 8.76 -10.41 1.02e-06 0.000493 6.16 126 NIA15k H3126 0.385 10.59 10.40 1.03e-06 0.000493 6.15 4360 NIA15k H34360 -0.341 9.37 -10.22 1.21e-06 0.000545 6.00 6794 NIA15k H36794 0.472 11.33 10.11 1.35e-06 0.000570 5.90 329 NIA15k H3329 0.413 11.37 9.97 1.53e-06 0.000612 5.78 5017 NIA15k H35017 0.434 11.41 9.90 1.63e-06 0.000618 5.72 2678 NIA15k H32678 0.461 10.44 9.74 1.90e-06 0.000618 5.57 2367 NIA15k H32367 0.409 10.21 9.72 1.93e-06 0.000618 5.56 1232 NIA15k H31232 -0.372 8.72 -9.70 1.96e-06 0.000618 5.54 111 NIA15k H3111 0.369 10.42 9.69 1.98e-06 0.000618 5.53 2159 NIA15k H32159 0.418 10.19 9.67 2.03e-06 0.000618 5.51 4258 NIA15k H34258 0.299 9.11 9.62 2.12e-06 0.000622 5.47 3192 NIA15k H33192 -0.410 9.46 -9.55 2.26e-06 0.000638 5.41 6025 NIA15k H36025 0.427 10.37 9.47 2.45e-06 0.000654 5.33 5961 NIA15k H35961 -0.362 8.50 -9.46 2.49e-06 0.000654 5.31 1404 NIA15k H31404 0.474 11.34 9.26 3.00e-06 0.000722 5.13 > volcanoplot(fit)

> > > >

82

Notes
Acknowledgements
Thanks to Yee Hwa Yang and Sandrine Dudoit for the ?rst three data sets. The Swirl zebra?sh data were provided by Katrin Wuennenburg-Stapleton from the Ngai Lab at UC Berkeley. Laurent Gautier made the ecoliLeucine data set available on Bioconductor. Lynn Corcoran provided the Bob Mutant data. The limma package has bene?ted from many people who have made suggestions or reported bugs including Naomi Altman, Henrik Bengtsson, Lourdes Pe?a Castillo, Marcus Davy, Ran mon Diaz-Uriarte, Robert Gentleman, Wolfgang Huber, William Kenworthy, Kevin Koh, Erik Kristiansson, Mette Langaas, Gregory Lefebvre, Andrew Lynch, James MacDonald, Martin Maechler, Ron Ophir, Francois Pepin, Hubert Rehrauer, Matthew Ritchie, Ken Simpson, Laurentiu Adi Tarca, Bj¨rn Usadel, James Wettenhall, Chris Wilkinson, Yee Hwa (Jean) Yang, o John Zhang.

Conventions
Where possible, limma tries to use the convention that class names are in upper CamelCase, i.e., the ?rst letter of each word is capitalized, while function names are in lower camelCase, i.e., ?rst word is lowercase. When periods appear in function names, the ?rst word should be an action while the second word is the name of a type of object on which the function acts.

Software Projects Using limma
The limma package is used as a building block or as the underlying computational engine by a number of software projects designed to provide user-interfaces for microarray data analysis including [7, 27, 28, 3], the KTH Package [20], SKCC WebArray [31], and CARMAweb [11]. The LCBBASE project provides a limma plug-in for the BASE database [14]. The Stanford Microarray Database http://genome-www5.stanford.edu calls out to limma for background correction options.

83

Citations
Biological studies using the limma package include [8, 19, 18, 2, 16, 26]. Methodological studies using the limma package include [12, 13].

84

Bibliography
[1] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Statist. Soc. B, 57:289–300, 1995. [2] P. C. Boutros, I. D. Mo?at, M. A. Franc, N. Tijet, J. Tuomisto, R. Pohjanvirta, and A. B. Okey. Identi?cation of the DRE-II gene battery by phylogenetic footprinting. Biochem Biophys Res Commun, 321(3):707–715, 2004. [3] Andreas Buness, Wolfgang Huber, Klaus Steiner, Holger S¨ltmann, and Annemarie u Poustka. arrayMagic: two-colour cDNA microarray quality control and preprocessing. Bioinformatics, 21(4):554–556, 2005. [4] M. J. Callow, S. Dudoit, E. L. Gong, T. P. Speed, and E. M. Rubin. Microarray expression pro?ling identi?es genes with altered expression in HDL de?cient mice. Genome Research, 10:2022–2029, 2000. [5] P. Dalgaard. Introductory Statistics with R. Springer, New York, 2002. [6] E. Diaz, Y. Ge, Y. H. Yang, K. C. Loh, T. A. Sera?ni, Y. Okazaki, Y. Hayashizaki, T. Speed, J. P., Ngai, and P. Schei?ele. Molecular analysis of gene expression in the developing pontocerebellar projection system. Neuron, 36:417–434, 2002. [7] Ste?en Durinck, Joke Allemeersch, Vincent J. Carey, Yves Moreau, and Bart De Moor. Importing MAGE-ML format microarray data into BioConductor. Bioinformatics, 20(18):3641–3642, 2004. [8] R. Golden, T. and S. Melov. Microarray analysis of gene expression with age in individual nematodes. Aging Cell, 3:111–124, 2004. [9] S. Hung, P. Baldi, and G. W. Hat?eld. Global gene expression pro?ling in Escherichia coli K12: The e?ects of leucine-responsive regulatory protein. Journal of Biological Chemistry, 277(43):40309–40323, 2002. [10] R. Irizarry. From CEL ?les to annotated lists of interesting genes. In R. Gentleman, V. Carey, S Dudoit, R Irizarry, and W. Huber, editors, Bioinformatics and Computational Biology Solutions using R and Bioconductor, pages 431–442. Springer, New York, 2005.

85

[11] Rainer J, Sanchez-Cabo F, Stocker G, Sturn A, and Trajanoski Z. CARMAweb: comprehensive r- and bioconductor-based web service for microarray data analysis. Nucleic Acids Res, 34(Web Server issue):W498–503, 2006. [12] C. Kendziorski, R. A. Irizarry, K.-S. Chen, J. D. Haag, and M. N. Gould. On the utility of pooling biological samples in microarray experiments. PNAS, 102(12):4252–4257, 2005. [13] Charles Kooperberg, Aaron Aragaki, Andrew D. Strand, and James M. Olson. Significance testing for small microarray experiments. Statistics in Medicine, 24:2281–2298, 2005. [14] Linnaeus Centre for Bioinformatics, Uppsala University, Sweden. BASE plug-ins. Software package, http://www.lcb.uu.se/baseplugins.php, 2005. [15] G. A. Milliken and D. E. Johnson. Analysis of Messy Data, Volume 1: Designed Experiments. Chapman & Hall, New York, 1992. [16] M. J. Peart, G. K. Smyth, R. K. van Laar, V. M. Richon, A. J. Holloway, and R. W. Johnstone. Identi?cation and functional signi?cance of genes regulated by structurally diverse histone deacetylase inhibitors. Proceedings of the National Academy of Sciences of the United States of America, 102(10):3697–3702, 2005. [17] A. Reiner, D. Yekutieli, and Y. Benjamini. Identifying di?erentially expressed genes using false discovery rate controlling procedures. Bioinformatics, 19:368–375, 2003. [18] S. C. P. Renn, N. Aubin-Horth, and H. A. Hofmann. Biologically meaningful expression pro?ling across species using heterologous hybridization to a cdna microarray. BMC Genomics, 5(42), 2004. [19] M. W. Rodriguez, A. C. Paquet, Y. H. Yang, and D. J. Erle. Di?erential gene expression by integrin β7+ and β7- memory T helper cells. BMC Immunology, 5(13), 2004. [20] Royal Institute of Technology, Sweden. KTH-package for microarray data analysis. Software package, http://www.biotech.kth.se/molbio/microarray/pages/ kthpackagetransfer.html, 2005. [21] G. K. Smyth. Paper 116: Individual channel analysis of two-colour microarrays. In 55th Session of the International Statistics Institute, 5-12 April 2005, Sydney Convention & Exhibition Centre, Sydney, Australia (CD). International Statistical Institute, Bruxelles, 2005. [22] G. K. Smyth, J. Michaud, and H. Scott. The use of within-array replicate spots for assessing di?erential expression in microarray experiments. Bioinformatics, 21(9):2067– 2075, 2005. [23] G. K. Smyth and T. P. Speed. Normalization of cDNA microarray data. Methods, 31(4):265–273, 2003. 86

[24] G. K. Smyth, Y. H. Yang, and T. Speed. Statistical issues in cDNA microarray data analysis. Methods in Molecular Biology, 224:111–136, 2003. [25] G.K. Smyth. Linear models and empirical bayes methods for assessing di?erential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, 3:Article 3, 2004. [26] Srinivasa Rao Uppalapati, Patricia Ayoubi, Hua Weng, David A. Palmer, Robin E. Mitchell, William Jones, and Carol L. Bender. The phytotoxin coronatine and methyl jasmonate impact multiple phytohormone pathways in tomato. The Plant Journal, 42(2):201–217, April 2005. [27] Juan M. Vaquerizas, Joaqu? Dopazo, and Ram?n D? ?n o ?az-Uriarte. DNMAD: web-based diagnosis and normalization for microarray data. Bioinformatics, 20(18):3656–3658, 2004. [28] Hua Weng and Patricia Ayoubi. GPAP (GenePix Pro Auto-Processor) for online preprocessing, normalization and statistical analysis of primary microarray data. Software package, Microarray Core Facility, Oklahoma State University, http://darwin. biochem.okstate.edu/gpap3, 2004. [29] J. M. Wettenhall and G. K. Smyth. limmaGUI: a graphical user interface for linear modeling of microarray data. Bioinformatics, 20:3705–3706, 2004. [30] R. D. Wol?nger, G. Gibson, E. D. Wol?nger, L. Bennett, H. Hamadeh, P. Bushel, C. Afshari, and R. S. Paules. Assessing gene signi?cance from cDNA microarray expression data via mixed models. Journal of Computational Biology, 8:625–637, 2001. [31] Xiaoqin Xia, Michael McClelland, and Yipeng Wang. Webarray: an online platform for microarray data analysis. BMC Bioinformatics, 6:306, 2005. [32] Y. H. Yang, S. Dudoit, P. Luu, D. M. Lin, V. Peng, J. Ngai, and T. P. Speed. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Research, 30(4):e15, 2002. [33] Y. H. Yang, S. Dudoit, P. Luu, and T. P. Speed. Normalization for cDNA microarray data. In M. L. Bittner, Y. Chen, A. N. Dorsel, and E. R. Dougherty, editors, Microarrays: Optical Technologies and Informatics, pages 141–152. Proceedings of SPIE, Volume 4266, 2001. [34] Y. H. Yang and T. P. Speed. Design and analysis of comparative microarray experiments. In T. P. Speed, editor, Statistical Analysis of Gene Expression Microarray Data, pages 35–91. Chapman & Hall/CRC Press, 2003. [35] Y. H. Yang and N. P. Thorne. Normalization for two-color cDNA microarray data. In D. R. Goldstein, editor, Science and Statistics: A Festschrift for Terry Speed, pages 403– 418. Institute of Mathematical Statistics Lecture Notes – Monograph Series, Volume 40, 2003. 87


赞助商链接

更多相关文章:
更多相关标签:

All rights reserved Powered by 甜梦文库 9512.net

copyright ©right 2010-2021。
甜梦文库内容来自网络,如有侵犯请联系客服。zhit325@126.com|网站地图