当前位置:首页 >> >>

A Comprehensive Microarray Data Generator to Map the Space of Classification and Clustering

A Comprehensive Microarray Data Generator to Map the Space of Classification and Clustering Methods Gregory Piatetsky-Shapiro and Georges Grinstein June 2004

We discuss our effort to date and plan to create a generator of synthetic but realistic microarray datasets, use it to build a comprehensive test suite for to studying the performance of different methods for gene selection, classification, and clustering, and find out which methods perform best under which conditions.

1. Introduction
It is well known that no specific machine learning algorithm is consistently superior for classification problems [Sc94,Wo96]. Past studies have established that a variety of algorithms achieve superior results for different datasets. It is commonly accepted that methods like neural networks are better under noisy conditions when the class depends on many features, while decision trees are better for structured and noise free data. Other methods with good performance include Support Vector Machines, Na?ve Bayes, Bayesian Networks, and nearest neighbor. In many business and scientific data mining applications, there are a large number of records (many thousands or millions), and a much smaller number of attributes (hundreds). We also usually know the correct class, e.g. “did the customer buy the product” or “is this image the digit 5 or letter S.” Under those conditions, we can reliably evaluate different algorithms and select the one with the best performance. However, in a number of important fields, especially the biological and life sciences, the data situation is quite different. A dramatic example is the analysis of DNA Microarrays [Ko02]. Microarrays are a ground-breaking technology, which holds the promise of revolutionizing life sciences. Applications already reported include the very accurate diagnosis of some types of cancer, the prediction of treatment outcome based on genetic signatures, and the derivation of genetic regulatory pathways. Potential future applications include faster identification of new drug targets, personalized and improved drug therapy based on individual genetic signatures, and ultimately helping speed the process of finding cures for many types of cancer and other diseases. For these life-saving applications it is critically important to identify the best possible results and most accurate algorithms. 1.1 The Problem One of the key problems in microarray data analysis is that the number of features (genes) in microarray data is typically very large (many thousands), while the number of records (samples) is typically very small, and usually less than 100. This situation is likely to remain so for many interesting cases, including human data, due to the difficulties and expense of collecting and analyzing samples. Analyzing such “ultra-wide” and “short” datasets is statistically likely to generate false positives, i.e. finding genetic markers that are due only to chance. In addition, the underlying biological networks are usually unknown or only partially known, and data has an unknown amount of measurement noise and even occasional mislabeled samples. Many algorithms have been used (e.g., SOM, KNN, SVM, BN, NN) but it is hard to judge which algorithm is actually better. For example, at recent CAMDA conference [CA03], researchers


analyzed the same lung cancer datasets but came up with different lists of significant genes. How can we determine which lists are closer to the truth before undergoing expensive and length clinical studies? Moreover, microarray datasets differ in their underlying technology, number of genes, samples, classes, level of noise, type of problem studied, and other parameters. Can we map the space of possible datasets and establish which algorithms are better for which types of data? Knowing the truth is difficult enough for classification, but even more so for clustering. All of these issues make it hard to determine what is the best classification or clustering for a given dataset and which algorithms are most appropriate for which types of microarray data. 1.2 The Proposed Solution To deal with these issues, we propose to create a generator of synthetic but realistic microarray datasets and use it to build a comprehensive test-suite of parameterized microarray datasets, for testing classification and clustering algorithms. While many dataset generators assume a normal distribution, this cannot always be assumed for microarray data. There are two key reasons for this. First, the underlying biological distribution is unknown or is a mixture of distributions for several classes. Second, the data goes through many transformations, such as normalization and thresholding, all of which effect the distribution. Our initial tests do show, however, that gene expression distributions for the genes over-expressed for a particular class frequently can be approximated with normal distribution Our initial focus will be on classification and clustering tasks and on Affymetrix-type data. We plan to later extend our data generator to other microarray platforms such as cDNA as well as other tasks, such as time-series analysis. The process for the design of the test suite for the other microarray platforms will be similar to the one described below. We plan to develop the code in R and release it via the BioConductor [Bio] project. 1.3 The Contribution This project is the first, to our knowledge, that offers an extensive and rigorous approach to evaluating algorithms for DNA microarray classification and clustering. This requires a comprehensive test suite, where the true classes and correct answers are known, and algorithms can be measured on their performance under parameters, such as number of genes, samples, number of classes, data and class measurement error, etc. Furthermore, reliable diagnostics cannot be built simply by determining the most likely class for a given sample. We must also determine the probability associated with that and other diagnoses in order to make cost-sensitive decisions. This is precisely what is difficult without a large test set with known answers. We believe that the comprehensive microarray test-suite and a microarray dataset generator will have a significant impact on molecular-based diagnostics, and the need for it is urgent.


We envision the widespread use of the test suite and generator, starting with generating a collection of data sets with changes in one or more parameters to determine the relative sensitivity of various algorithms to these parameters. For a given algorithm, we would be able to determine the probability range that the algorithm makes a correct prediction. For a desired level of confidence we would be able to determine the minimum number of samples needed to achieve this confidence under a variety of conditions. Many more uses can be imagined. We expect that the microarray data generator will help determine the most appropriate microarray classification and clustering algorithms and their confidence levels and facilitate use of microarrays in diagnostics, finding the best treatment, and ultimately cure for many diseases. It is also clear that this approach will have wider applicability to domains where samples are limited and ground truth unknown.

2. Background and Previous Research
Cells in the same organism normally have the same genes, but these genes can be expressed differently, i.e. manufacture different messenger RNA (mRNA), which in turn manufacture different proteins, allowing creation of a huge variety of different cells. Virtually all differences in cell state or type are correlated with changes in the mRNA levels of many genes.
Figure 1. An example raw microarray image for a single sample (image courtesy of Affymetrix). The brightness of the dots represents the intensity of the expressed genes. The image on the left is translated by microarray imaging software into a table of numbers.

Gene D26528_at D26561_cds1_at D26561_cds2_at D26561_cds3_at D26579_at D26598_at D26599_at

Value 193 -70 144 33 318 1764 1537

DNA Microarrays are a ground-breaking technology (Figure 1) for measuring gene expression. As of 2003, microarrays such as the Affymetrix U133 2-chip set, can measure expression of over 30,000 genes, a majority of the expressed human genome. The reported applications of microarrays already include: ? Determination of genetic regulatory networks and other biological pathways [Ha02, ID01, YC02] ? Improving molecular disease diagnosis using gene expression levels [Ca00, Ca03, Du00, Du02, NR02, Go99, Ra01, and many more] ? Predicting treatment outcome after drug therapy based on gene expression profiles [Po02, Ka02] The main types of microarray data analysis to support the above applications include gene selection – finding the genes most correlated with a set of symptoms; classification: finding classifiers for disease classes and/or outcomes when class labels are given, and clustering: finding clusters in data when classes are not known or not given.


2.1 Importance of High Accuracy for Medical Diagnostics For medical applications, such as diagnostics or helping to choose the right treatment, it is extremely important to get the answer with as high accuracy as possible. Consider a population of size N, and a disease that appears in d % of the population. If we have a test that has fpos % false positive rate and fneg % false negative rate then, the probability that a person chosen at random from that population has the disease given that the test is positive is d * (1 ? fneg ) p(disease | test = yes) = d * (1 ? fneg ) +(1 ? d ) * fpos) For example, with a disease rate=1%, fpos=5% and fneg=10%, the probability that a person has a disease given the positive test result is only 0.15. Tests like that are ineffective and make thousands of patients unnecessarily worried. To get to probability of 2/3 that a person has the disease given the positive test outcome, we need to reduce the fpos and fneg rates down to 0.5% (0.005). However, obtaining such high accuracy in microarray data analysis is hampered by a number of challenges which we describe in the next sections. 2.2 Too Many Genes and Too Few Samples One major problem is that the typical number of columns (genes) is quite large (many thousands), while the typical number of records (samples) is much smaller (usually less than a 100). Having so many columns relative to the number of records, is likely to lead to many false discoveries, such as gene markers that appear to be correlated with symptoms, but are only due to chance, and models that appear to have a good classification accuracy but are overfitting the data. This is especially true for non-linear learning algorithms such as decision trees or neural networks, that look for non-linear combinations of features, and have a much larger model space in which they can find spurious models. Because of the many difficulties in the collecting and processing of microarray data, this problem is likely to remain in many important cases, especially the analysis of human diseases, and we must develop methods to assess the accuracy of the gene selection and classification under such conditions. One common approach for reducing the false positives is to first use a linear method like T-test to select a smaller number (e.g. 100-200) of the most promising genes, and then build models using these genes. For the 2-class cases, we can rank genes by comparing the mean expression values of genes for each class, and compute measures like T-value or S2N (signal to noise ratio):


Mean1 ? Mean2
2 1 2 2

SD / N1 + SD / N 2

S 2N =

Mean1 ? Mean2 SD1 + SD2

Here Mean1 and Mean2 are gene expression means (across all samples) , SD1, SD2 the standard deviations, and N1 and N2 are counts for each class. For multi-class case, we can compute these measures for each class vs. the rest. The models produced in this way are more accurate than models which work with a larger number of genes. 2.3 Instability of Significant Genes However, the top genes selected by different methods and different researchers even on the same data sets show very significant variability. Figure 2 shows a visualization of the significant genes found in analysis of the well-known ALL/AML data [Go99, Ca00] by six different groups.

While Zyxin is present in almost all analyses, the rest of the genes are quite different. Similar disagreement was observed in the CAMDA 2003 competition [Ca03], where many researchers analyzed the same data, and came up with a different set of significant genes. Even using the same method on different data subsets produces instability. We analyzed a 92sample, 5-class pediatric brain tumour data [Br03]. We used 3-fold cross-validation and for each fold generated top 50 genes ranked by T-value, and examined the commonality of the top genes in different folds. In 2 of the 5 classes, 21 and 23 out of 50 genes were in common among all 3 folds, while for the other 3 classes only 2 to 4 genes out of 50 were common among all 3 folds. This clearly implies that the majority of the top 50 genes are likely to be correlated with the target class only due to chance.

Figure 2. A parallel coordinate representation of the cluster relationships between 6 of the published CAMDA 2000 groups’ gene predictor sets and AnVil Informatics’ gene predictor set (from Gr01). This illustrates a high level of disagreement between gene sets.

We can assess the likelihood of getting a chance correlation by randomly permuting the class column a large number of times and comparing the correlation one gets by using randomization when compared to the actual class column. The randomization approach is further refined in the SAM method developed by Tusher and Tibshirani [Tu01] which estimates the false discovery rate. However, randomization-based approaches suffer from the significant problem of not knowing the ground truth. 2.3 Lack of Absolute Ground Truth In most cases, the underlying biological networks and genes are at best partially known and it is hard to determine without expensive and time-consuming biological tests which genes are actually the correct ones. Even for classification, when the labels are assigned by human experts, we may not have the actual truth. One problem is that many disease types are poorly differentiated and even experts are not always able to reliably classify them. For example, in Golub & Slonim et al. data set [Go99] sample 66 is consistently misclassified by most algorithms, and is strongly suspected to have been misclassified by the pathologist [Ta02]. A similar situation occurred in our analysis of a pediatric tumor data from Children’s Memorial Hospital in Chicago [Br03], where a single instance of Medulloblastoma is consistently misclassified by all algorithms, and is suspected to be mislabeled.


Hastie et al [Ha02a] emphasize the difficulty “to ascertain the validity of inferences drawn from the output of most unsupervised learning algorithms.” They argue that heuristic arguments are then presented not only for motivating the algorithms but also for judgments as to the quality and strength of the results. Effectiveness then becomes a matter of opinion, cannot be verified directly, and encourages the proliferation of the selected techniques. Finally, even if the true biological process were known, there is a natural biological variability which makes gene expressions vary by an unknown amount. 2.4 Many Processing Steps The microarray data is the result of many complex steps, including microarray probe creation and placement, sample preparation, hybridization, image analysis, data analysis and normalization, and many more, each of which may introduce its own noise of unknown amount. For example the Affymetrix model for the background image noise involves the average over of all background cells, inter-chip and intra-chip noise, and biological variation. Each step has its own noise model, some being quite complicated. The image files are then processed by different software tools, such Affymetrix Microarray Suite (MAS) versions 4 or 5 from Affymetrix, Probe Profiler from Corimbia, or dChip from Harvard, each of which can produce levels of gene expression for a given gene. MAS 4 processing consists of taking the average difference of 20 perfect match and mismatch expression levels, which can frequently result in negative numbers. MAS 5 processing uses a more complex formula and does not yield negative numbers. Other approaches, such as [Ir03] use only the data from positive matches. As a result, the great complexity of the microarray data analysis process can introduce errors that affect the results. In the CAMDA’02 competition [Ca02], a significant part of the microarray data that was used in the competition was discovered by one of the participants to be mislabeled [St02], invalidating most results published earlier based on that data.

2.5 Is There a Best Classification Method?
Classification is defined as using predictive modeling techniques such as classifiers on data already labeled with pre-existing classes to build a model that can be used to classify additional data. Popular classification methods include decision trees, neural networks, logistic regression, nearest neighbor, na?ve Bayes classifiers, Bayesian networks, and Support Vector Machines [Fa96, Ha00a, Ha00b]. Additional classifiers have been developed specifically for microarray data, including “weighted votes” [Go99], “shrunken centroids” [Ti02], and DLDA (diagonal linear discriminant analysis) [Du02]. Theoretical results in machine learning show that there is no classification algorithm which will produce the lowest error for all kinds of data [Sc94,Wo96]; this has been confirmed by many of the empirical studies cited earlier. John Quackenbush [Qu01], writes “Indeed, it is becoming increasingly clear that there might never be a ‘best’ approach and that the application of various techniques will allow different aspects of the data to be explored.” Minimizing classification error, while important, is only one aspect of algorithm selection. Other factor include how explainable are the results, including different error costs in the analysis, as


well as training and scoring times (although the latter tend to be quite small for most microarray data). Thus while we cannot find an optimal algorithm, we may be able to find a very good algorithm for the particular dataset or types of datasets. Both theory and practice suggest that different methods are more appropriate for data with certain characteristics. For example, decision trees are good for domains like chess having low levels of noise and features that contribute to the output “sequentially”. Neural networks are better for domains that are relatively noisy with many features that contribute to the output in parallel. For microarray data, we need algorithms that would be especially robust to noise and would be able to work with a small number of cases relative to the number of genes. Because of lack of absolute ground truth and many sources of errors, we need synthetic datasets where the true answer is known, to determine which algorithms are superior and under which conditions. 2.6 What is The Best Clustering? Clustering determines a domain-dependent natural and useful partitioning of a data set into a number of clusters [An73, Sp80]. Clustering partitions the data so that the pair-wise distances (similarity) between records assigned to the same cluster are smaller than the distance to records in other clusters. The distance or similarity measure is critical and can often alter the cluster structure. Clustering has had its origins in statistics and much computational and theoretical work is still based there [Sc64, En69, Ha75]. Clustering is one of the key analytic steps in almost all data analysis and often requires a number of decisions (parameter selection). Consider for example the NCI tool for producing a cluster image map [Sc00, Ni01]. When not using the defaults, the user may select a centering technique (2 choices), whether to randomize (2 choices), the cluster method (3 choices), and the cluster distance metric (6 for each axis). This yields a total number of 432 different computations, each with additional variable parameters (the proportion to remove and which subset of the cells to use), and each with a large number of possible displays (windows, trees, colors, scales, and data ranges). This is staggering and clearly emphasizes the number of possibilities for one algorithm: the evaluation of a single algorithm may require evaluation under a large number of variables each with a possible large range of values. Many different approaches have been applied to clustering microarray data [Ti99], but two of the most popular approaches are Self-Organizing Maps (SOM) [Ko82, Ko95, Go99] and hierarchical clustering [Da84, MS80, Sc96]. In hierarchical clustering a tree-like structure (or forest, a collection of trees) is developed. The result is a partitioning of the data in a parent/child relationship (or dependency) based on some measure of proximity. In a SOM the data is segmented or partitioned into subsets consisting of similar records with a similarity based on a domain-dependent metric. While for classification at least it is clear when the result is correct, for clustering there is rarely an absolute ground truth and it is frequently difficult to say when one clustering algorithm is better than another. Thus to rate the performance of a clustering algorithm we can perform further


biological tests [Ko02] or, as we argue in this proposal, measure the algorithm under a variety of realistic constraints. 2.7 Visualization We will generate a number of supporting visualizations focused on how different algorithms handle the individual records (genes). This will be a gene-centric representation of data using a variety of attributes to display the different genes each cluster within differing algorithms. Figures 7 and 8 use RadViz [Fa01], a radial visualization we invented that can handle a large number of variables and represents a visualization of the cosine correlation of the genes the two clustering techniques.

Figure 3. Anvil’s (black circles) and Scherf et al’s (colored circles, fixed size) clustering results co-displayed on top of the 9 cancer cell types. Here we can see that the Re9_CO cluster is fairly stable under both AnVil’s and Scherf’s cluster, as is Re7_ME with 1 error, whereas Re3_LC is quite variable.

It is clear that such gene-centric visualizations help in identifying (and perhaps later in understanding) gene stability under differing clustering techniques. The synthetic data sets we propose to generate and the above visualizations will allow us to focus in on how and which genes move under differing clustering techniques. Since ground truth will be known within these synthetic data sets we will be able to see how and where different algorithms affect grouping.

3. Generating Artificial Data Sets
The shapes of most continuous distributions can be sufficiently summarized in the first four moments [NIST]. Put another way, if one fits to a histogram of observed data a distribution that has the same mean (first moment), variance (second moment), skewness (third moment) and kurtosis (fourth moment) as the observed data, then one can usually approximate the overall shape of the distribution very well. We plan to analyze the distributions of these four moments in existing microarray data, both Affymetrix, and cDNA, and come up with approximate distributions that model the actual data. Synthetic datasets have been widely used in research in the past. For example, Agrawal and Srikant [AS94] implemented a synthetic data generator that generates data by simulating transactions in a supermarket. ARminer [Cr01] also includes a facility to generate synthetic transactions.


Keim, Bergeron, and Pickett created test data sets for evaluating Data Visualization Techniques [Ke94]. Gabor Melli [DG99] created a dataset generator with user constraints for testing machine learning algorithms. DARPA generated datasets with simulated Russian nuclear materials smuggling and Russian contract murders for its EELD (Evidence Extraction and Link Discovery) program [Mo02]. There are dataset generators for specific purposes as varied as XML data [Ab01] and outbred pedigrees in plant genomics [Sh95]. The R package includes a sagmbSimulateData method that provides a simple simulation of data variance stabilization of microarray data. These numerous varied efforts highlight that successful synthetic datasets need to model closely the key features in each domain. Much work is going on in attempting to generate standards for the creation of clean data sets for gene prediction for comparative studies. Much work is also going on for the construction of gene networks from temporal models of gene expression, the collection of such data being difficult. The development of a parameterized controlled suite of data sets will support both activities. The first by identifying the key distributional and noise parameters that can approximate such data sets; the second by generating synthetic data sets approximating long time course data (future work for this project). Some have developed simulated data sets but only for specific studies and not in as broad a sense as we are proposing. Pollard and van der Laan [PL02] generated their synthetic data to compare the performance of their Mean Split Silhouette technique with others. Several synthetic microarray datasets were generated by Yeung et al [Ye01] to investigate model-based clustering. The novelty in our proposal is a comprehensive approach to generate synthetic but realistic microarray data sets, where the ground truth is known and the level of noise and other parameters can be controlled all while approximating real data sets. 3.1 What is the right level to synthesize microarray data? Before analysis, microarray data goes through several transformation steps. First, the visual image is scanned (see Figure 1) and raw intensity levels are recorded. For the Affymetrix platform, these are stored in CEL files, which have text format and contain summarized probelevel (PM, MM) data. The CEL file is then processed through a variety of software algorithms that aggregate probe-level data into a single measurement, resulting in one number for each gene. A similar process exists for cDNA data. Generating raw visual images would divert the focus of the project into image recognition issues. Generating realistic probe-level data would be much more complex than generating data at the gene level, due to the need to model the unknown inter-probe relationships. We propose to generate data at the gene level, which is also the type of data that is usually input to microarray classification algorithms. As long as the generator is consistent in giving higher values to genes up-regulated for a particular class, and lower values to down-regulated genes, the synthetic microarray data would give a realistic test to classification and clustering algorithms.


3.2 Generating Simulated Microarray Datasets Microarray datasets are different from supermarket transactions and existing generators are not sufficient, because they fail to model the specific distributions of different genes, the correlations between means and the standard deviations, and provide all other parameters needed to simulate realistic microarrays. The approach we propose is first to examine the distribution of key parameters in microarray data and then generate a suite of artificial datasets which have similar parameters. In the following sections, we report results from initial experiments which were done on two datasets: Golub et al [Go99] Leukemia ALL/AML train data set (7070 genes, 38 samples, 2 classes) and Michigan Lung Cancer Data [Be02], (7070 genes, 96 samples, 3 classes). While very preliminary, the initial results are promising. 3.3 Distribution of Gene Means and Standard Deviations We first examined the distribution of gene means over all genes. While the raw means do not show a normal distribution, we found that the log means do. Some genes had a negative mean which we transformed using nlog10(X) = –log10(-X) (Figure 4). In both datasets we see a large peak on the right for to positive means and a small peak on the left -- for negative means.

Golub ALL/AML train data

Michigan Lung Cancer data, 96 samples
2000 Frequency







0 nlog10 mean



0 -4





0 nlog10 mean



Figure 4: Histograms of log10 means of raw gene expressions (negative means transformed using nlog10(X) = –log10(-X) Genes with negative means correspond to probes where there are more mismatches than matches and are generally not used as predictive markers. We also excluded from following analysis genes with mean < 1 – this removed negative logs and there were only a few such genes. We then computed statistics of log10(mean) – see table 1. Table 1: Statistics of log10(means) Dataset Golub Michigan Genes w. Mean>=1 74.2% 92.7 % Mean of log10(mean) 2.437 2.287 St. Dev of log10(mean) 0.659 0.669 Skew of log10(mean) 0.042 -0.021 Kurtosis of log10(mean) 0.6 0.336


We then normalized the distribution of log10(means) by subtracting the overall mean and dividing by overall standard deviation. While Shapiro-Wilkes test for normality [Ro92, St86] give a low p-value, because of the lack of normal behavior at the extremes of the distribution, the plots in Fig 5 show a remarkably good agreement of the log10(means) distribution and the normal distribution.

Figure 5: Plot of log10(means) vs normal distribution. Note how close are the actual log10(means) (blue) and the normal distribution (red line) 3.4 Relationship between Gene Means and Standard Deviations Next, we examined the relationship between means and the standard deviation, which is easiest to see on the log-log plot (Figure 6). For both datasets, there is a strong linear correlation between log means and standard deviation. We also see that the standard deviation tends to increase slightly faster for higher mean, and a better fit could be obtained by a non-linear model.

Figure 6: log Means vs log St. Dev, for Golub and Michigan data, with black line showing the linear regression fit Simple linear models give the following results. For the Golub data: log10(SD) = 0.600 log10(Means) + 0.895 ; adj. R-squared: 0.698; residual standard error: 0.260


and for the Michigan data: log10(SD) = 0.651 log10(Means) + 0.644 ; adj. R-squared: 0.761; residual standard error: 0.244 3.5 Distribution of individual genes Our initial analysis of the distribution of individual genes shows that a large percentage has a distribution close to normal. For example, in Golub data set about 43% have Shapiro-Wilkes pvalues > 0.1. The fit improves if the data is separated by classes. For example, in the Golub dataset one of the strongest genetic markers is Zyxin (X95735 probe). If we analyze Zyxin for all cases (Fig 7a) we see that it is far from normal (Shapiro-Wilkes p < 0.0001). However, Zyxin is strongly up-regulated for AML. If we plot Zyxin values only for AML (Figure 7b), we can see that the distribution is much closer to normal (Shapiro-Wilkes p= 0.22).

10 9


8 7




6 5 4 3 2




1 0

Figure 7a : Zyxin for ALL and AML

7b: Zyxin for AML only

3.6 Generating Artificial Microarray Data To test our initial findings, we generated artificial gene expression data with 6000 genes and 30 samples, equally split among 2 classes, High and Low. We first generated means for each gene using a log normal distribution modeled after the Golub data. For each gene we set the standard deviation using the regression formula derived from the Golub data. We then used these means and standard deviation value to generate the values for each gene using the random normal distribution. The first 10 genes were the “true” genes, whose mean expression value for High class was up-regulated by UP factor. We ranked the gene expression values by the T-value of the mean difference and recorded the actual rank of the “true” 10 genes. We repeated this experiment 10 times and summarize the results in Table 2. Table 2: Rank average and standard deviation for true genes, according to T-value, for 6000 genes and 30 samples, and varying UP-regulation factor
UP True Gene 1 2 3 4 5 6 7 8 9 10

Rank Avg Rank SD 2 Rank Avg Rank SD 1.5 Rank Avg Rank SD 3

1 0 1 0 1.1 0.3

2 3 4 5 6.4 0 0 0 0 1.3 2 5.7 8.5 34.7 71.5 0 7.5 10.7 46.3 89.3 7.3 82.6 146.9 310.9 496.9 8.7 127.7 187.2 312.7 441.4

9.7 25.4 242.5 986.5 6.6 38.3 486.4 1411.2 113.3 247.5 774.3 2288.9 119.4 211.0 724.9 1018.2 878.4 1701.7 2656.5 4119.5 707.7 763.2 956.8 1502.5


On the average, for UP=3, 7 of the top 10 genes are the true genes, while for UP=2 only 4 of the top 10 are the true genes, and for UP=1.5 only 2 of the top genes are the true ones. The results for this experiment and another experiment where we kept UP=2, but varied number of microarray samples (MA) from 10 to 40, are plotted on Figure 8. Both again highlight the validity and potential utility of our proposed synthetic generator approach.
4500 Avg Rank in Simulated data
Rank in Simulated Data 4500 4000 3500 3000 2500 2000 1500 1000 500 0 1 2 3 4 5 6 7 8 9 10 MA=10 MA=20 MA=30 MA=40

4000 3500 3000 2500 2000 1500 1000 500 0 1 2 3 4 5 6 7 8 9 10 True Rank UP=3 UP=2 UP=1.5

True Rank

Figure 8: True Rank vs Average Rank in the simulated data

4. Detailed Work Plan
We plan first to analyze publicly available gene expression datasets from Stanford [SMD], MIT [MIT] and other sources and determine the types of distribution, their moments, and key parameters. These include distribution of gene expression means, standard deviations, skewness, kurtosis, fit to normal and other distributions, both overall and with respect to sample classes. This will allow us to select the range of settings for creating realistic synthetic datasets. For example, can genes uncorrelated to the class be modeled with a normal distribution? What is the effect of preprocessing with different software, such as MAS 4, MAS 5, or Probe Profiler on data? Is log normal distribution a good fit to the distribution of mean values for microarrays? A report and paper will then be generated with the results and an analysis. Software Development Steps We propose two development stages: one to accelerate the development and release of the suite by working within R [CRAN} and the other to provide improved computational performance and a more interactive environment by working within the labs’ Universal Visualization Platform. 1. We will initially prototype the synthetic data set generator in R and generate a documented package for the Affymetrix test suite. R has already been used extensively for bioinformatics and microarray data analysis, and has many useful statistical functions built-in. The software and documentation will then be released. 2. We will build a number of prototype highly interactive visualizations and algorithms within the UVP [Ag04]. That platform is available in Java and is the basis for visualization and data mining courses, and already contains a number of bioinformatics algorithms. The resulting software and documentation will be released as well.


The artificial dataset generator will have a number of user-controlled key parameters, including ? Number of genes ? Number of samples ? Number of classes ? Data distribution for overall mean, standard deviation, etc. (Zipf, parameters, an example distribution given as a histogram) ? Data distribution for individual genes un-correlated to class (Normal, mixture, example distribution) ? Data measurement error function (Percentage noise, absolute noise, noise distribution normality). The noise model will reflect the known effect that noise as a percentage is much higher at lower levels of measurement [Qu01]. ? Number of up and down regulated genes per class. Тhe true class will be reflected in specific genes being up and down regulated ? Data distribution for genes correlated to class ? Class measurement error function – this is independent of the true class and indicates whether the true class was measured correctly The user will specify the key parameters (with defaults available) and the system will generate the desired datasets, using the approach described in section 3.6. We will use the generator to create a number of stress-test datasets, where we vary key parameters, such as number of samples, level of data measurement noise, or class measurement noise. These datasets will be used for determining the performance of existing algorithms under different conditions. We will also provide a mechanism to build look-alike datasets. The user will provide a single existing dataset, and the system will generate desired synthetic datasets, which will match the overall distribution of the key parameters in the given data set. As a large number of such data sets can be generated, this will provide the user a mechanism by which to evaluate the stability of any algorithm under different but statistically equivalent data sets (equivalent based on the selected key parameters). We will test our approach with a variety of algorithms, including our own, publicly available such as Weka [WE99] and R[R], and commercial. Our results will be submitted to appropriate journals and conferences.

5. Summary
A key problem in microarray data analysis is that the number of features (genes) in microarray data is typically very large (many thousands), while the number of records (samples) is typically very small, and usually less than 100. Analyzing such “ultra-wide” and “short” datasets is statistically likely to generate false positives. In addition, the true biological processes are usually unknown and data has some noise and errors from many sources, and even occasional mislabeled samples. Many algorithms have been used (e.g., SOM, KNN, SVM, BN, NN) but it is hard to judge which algorithm is actually better on what type of data. To deal with these issues, we propose to create a generator of synthetic but realistic microarray datasets and use it to build a comprehensive test-suite of parameterized microarray datasets, for testing various algorithms for gene selection, classification, and clustering, and find out which methods perform best under which conditions. Our initial analysis shows promising results for our ability to approximate the overall gene expression distributions. Our initial focus will be on classification and clustering tasks and on Affymetrix-type data.

Cooperation with molecular biologists is critical for this effort. We have worked recently with experts in microarrays and molecular biology from MIT/Whitehead Institute [PR02, PK03, PT03a], and plan to cooperate with them on this project as well. We also discussed this project with researchers at Stanford and Berkeley who expressed interest in cooperating with on this project.

6. References
[Ab01] Aboulnaga, A., J. Naughton, C. Zhang, Generating Synthetic Complex-structured XML Data, http://www.cs.wisc.edu/niagara/papers/webdb01xmldatagen.pdf [Ag04] Gee A., A Universal Visualization Platform, Dissertation, 2004, University of Massachusetts at Lowell. Also in A, Gee, L. Li , M. Yu., M.B. Smrtic, U. Cvek., H. Goodell, V. Gupta, C. Lawrence, J. Zhou, C-H. Chiang and G. Grinstein (2005). Universal Visualization Platform, Proceedings of the Visualization and Data Analysis SPIE-IS&T Electronic Imaging Conference, San Jose, California, January 17-18, 2005, SPIE Vol. 5669, pp. 274 – 283 [An73] Anderberg, M.R. (1973), Cluster Analysis for Applications, New York: Academic Press, Inc. [AS94] Agrawal, R. and R. Srikant, "Fast Algorithms for Mining Association Rules", Proc. 20th Int. Conf. Very Large Data Bases, VLDB 1994. [Be02] Beer, D, et al. Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nature Medicine 9 (816), 2002. [Bio] Bioconductor project, http://www.bioconductor.org/ [Br03] Bremer, E., personal communication, 2003. [Ca00] CAMDA 2000, Proceedings of Critical Assessment of Microarrays Conference, Duke University, 2000, www.camda.duke.edu/camda00 [Ca02] CAMDA 2002, Proceedings of Critical Assessment of Microarrays Conference, Duke University, 2002, www.camda.duke.edu/camda02 [Ca03] CAMDA 2003, Proceedings of Critical Assessment of Microarrays Conference, Duke University, 2003, www.camda.duke.edu/camda03 [Ch83] Chang, "On Using Principal Components before Separating a Mixture of Two Multivariate Normal Distributions", Applied Statist., 32, 267-275, 1983. [Ch02] Cheng, J. et al, KDD Cup 2001 Report, SIGKDD Explorations, January 2002. Volume 3, Issue 2, http://www.acm.org/sigkdd/explorations/issue3-2.htm. [CRAN] CRAN, http://cran.r-project.org/.


[Cr01] Cristofor, L. ARMiner Project, http://www.cs.umb.edu/~laur/ARMiner/ [Da84] Day, W. H., & Edelsbrunner, H. (1984) Efficient algorithms for agglomerative hierarchical clustering methods, Journal of Classification, 1(1):7-24. [DG99] Gabor Melli, Dataset Generator, www.datasetgenerator.com [Du00] Dubitzky, W. et al, Symbolic and Subsymbolic Machine Learning Approaches for Molecular Classification of Cancer and Ranking of Genes, in Proceedings of CAMDA 2000, Duke University, 2000. [Du02] S. Dudoit, J. Fridlyand, and T. P. Speed (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, Vol. 97, No. 457, p. 77--87. (Tech report #576). [EELD] EELD link discovery datasets, http://www.darpa.mil/iao/EELD.htm [En69] Englemann, L. and Hartigan, J.A. (1969), Percentage Points of a Test for Clusters, Journal of the American Statistical Association, 64, 1647-1648. [Fa96] Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth, From Data Mining to Knowledge Discovery in Databases, AI Magazine 17(3): Fall 1996, 37-54 [Fa01] Fayyad U., G. Grinstein and A. Wierse (2001). Information Visualization in Data Mining and Knowledge Discovery, Morgan-Kaufmann Publishers. [Go99] Golub, T. et al, Science (v. 286), Oct 1999. [Gr01] Grinstein, G. , B. Jessee, P. Hoffman, A. Gee, and P. Oneil (2001), High Dimensional Visualization Support for Data Mining Gene Expression Data, in DNA Arrays: Technologies and Experimental Strategies, CRC Press LLC, Florida. [Ha00a] Han, J. , M. Kamber, Data Mining : Concepts and Techniques, Morgan Kaufmann, 2000 [Ha00b] Hand, D., H. Mannila and P. Smyth, Principles of Data Mining , MIT Press, 2000 [Ha02] Hartemink et al, Combining Location and Expression Data for Principled Discovery of Genetic Regulatory Network Models, PSB 2002 psb.stanford.edu/psb-online [Ha75] Hartigan, J.A. (1975), Clustering Algorithms, New York: John Wiley & Sons, Inc. [Ha02a] Hastie, Tibshirani, and Friedman, The Elements of Statistical Learning, [Id01] Ideker, T., et al., Science 292 (May 4, 2001) 929-934. [Ir03] Irizarry, R. et al. (2003) Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data, Biostatistics, 2003

[Ka02] Kallioniemi, A., Molecular signatures of breast cancer — predicting the future, N Engl J Med, Vol. 347, No. 25, December 19, 2002 (www.nejm.org) [ [Ke94] Keim, D., R. D. Bergeron, R. M. Pickett, Test Data Sets for Evaluating Data Visualization Techniques, in Perceptual Issues in Visualization, Springer, 1994. [Ko02] Kohane, I., Kho and Butte, Microarrays for an Integrative Genomics, A Bradford Book, MIT Press, 2003. ISBN 0-262-11271-X [Ko82] Kohonen, T., “Self-organized formation of topologically correct feature maps,” Biological Cybernetics, vol. 43, pp. 59 - 69, 1982. [Ko95] Kohonen, T. (1995). Self-Organizing Maps. Springer-Verlag, Berlin. [MIT] MIT Whitehead Institute Cancer Genomics Publications Data Sets, http://www-genome.wi.mit.edu/cgi-bin/cancer/datasets.cgi [Mo02] Mooney, R., P. Melville, L. R. Tang, Relational Data Mining with Inductive Logic Programming for Link Discovery, Proceedings of the National Science Foundation Workshop on Next Generation Data Mining, Nov. 2002, Baltimore, MD. [MS80] Mizoguchi, R. and Shimura, M. (1980), A Nonparametric Algorithm for Detecting Clusters Using Hierarchical Structure, IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-2, 292-300. [NI00] NIH, Genomics and Bioinformatics Group, Cluster Image Map Software, http://discover.nci.nih.gov/nature2000/tools/cluster.html [Ni01] National Institute of Health, Genomics and Bioinformatics Group, Cluster Image Map Software, http://discover.nci.nih.gov/nature2000/tools/cluster.html [NIST] NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/ [NR02] Nguyen DV, Rocke DM., Multi-class cancer classification via partial least squares with gene expression profiles, Bioinformatics 2002 Sep;18(9):1216-26 (in PubMed) [Po02] Pomeroy, S. et al 2002, Nature (415), Jan 2002 [PL02] Pollard, K.S. and M.J. van der Laan, A Method to identify significant clusters in gene expression data, In SCI2002 Proceedings, volume II, pages 318-325, International Institute of Informatics and Systemics, 2002.


[PR02] Piatetsky-Shapiro, G. and Sridhar Ramaswamy, "Analyzing Microarray Gene Expression Data", with Sridhar Ramaswamy, SPSS webcast, June 26, 2002, www.kdnuggets.com/gpspubs/microarray/ [PK03] Piatetsky-Shapiro, G., T. Khabaza, S. Ramaswamy, Capturing Best Practices for Microarray Data Analysis, Proceedings of ACM KDD-2003 Conference, ACM Press, 2003. [PT03a] Piatetsky-Shapiro, G., P, Tamayo, Microarray Data Mining: Facing the Challenges, SIGKDD Explorations 5(2), Special Issue on Microarray Data Mining, Dec 2003 [PT03b] Piatetsky-Shapiro, G., P, Tamayo, Editors, SIGKDD Explorations 5(2), Special Issue on Microarray Data Mining, Dec 2003 [Qu01] Qackenbush, J., Computational Analysis Of Microarray Data, Nature Reviews: Genetics, June 2001, Vol 2. [R] The R Project for Statistical Computing, http://www.r-project.org/ [Ra01] Ramaswamy, S. et al, Multiclass cancer diagnosis using tumor gene expression signatures, PNAS, December 18, 2001, vol. 98, no. 26, 15149–15154 [Ro92] Royston P. , Approximating the Shapiro-Wilkes W test for Non-Normality. Statistics and Computing 1992; 2:117-119. [Sc94] Schaffer, C.: A Conservation Law for Generalization Performance. In Proceedings of the 11th International Conference on Machine Learning (1994), 259--265 [Sc64] Schnell P. (1964), A Method for Discovering Data-Groups, Biometrica 6, 47-48. [Sc00] Scherf U., et al (2000). A gene expression database for the molecular pharmacology of cancer, Nature Genetics. 24, 236-244; and discover.nci.nih.gov/nature2000/tools/cluster.html [Sc96] Schikuta E. (1996), Grid clustering: An efficient hierarchical method for very large data sets, Proceedings of 13th Conference on Pattern Recognition, Vol. 2 IEEE Computer Society Press, pp. 101-105. [Sh95] Sherman, B., M. Sewell, and D. Neale, “Syd, A Synthetic Data Generator for Three Generation Outbred Pedigrees”, Plant Genome IV Conference, San Diego, CA, 1995. [SMD] SMD: Stanford Microarray Database, genome-www5.stanford.edu/MicroArray/SMD/ [Sp80] Spath, H. (1980), Cluster Analysis Algorithms, Chichester, England: Ellis Horwood. [St02] Stivers, D., Wang, J., Rosner, G., Coombes, K., Organ-Specific Differences in Gene Expression and UniGene Annotations Describing Source Material, CAMDA 2002 Proceedings, http://www.camda.duke.edu/camda02/Abstracts/Coombes.asp


[St86] Stevens M.A. , D'Agostino R.B., Goodness of Fit Techniques. Marcel Dekker, New York 1986. [Ta02] Tamayo, P., personal communication, 2002. [Ti99] Tibshirani, R. et al, “Clustering methods for the analysis of DNA microarray data”, Technical report, Department of Statistics, Stanford University, 1999. [Ti02] Tibshirani, R. et al, "Diagnosis of multiple cancer types by shrunken centroids of gene expression", PNAS 2002 99:6567-6572 (May 14). [Tu01] Tusher, Tibshirani, and Chu, Significance analysis of microarrays applied to the ionizing radiation response. PNAS 2001 98: 5116-5121. [WE99] Witten, Ian and Eibe Frank, Data Mining, Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufman, 1999. [Wo96] Wolpert, D., The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7):1341-1390, 1996. [YC02] Yoo C., Cooper G., Discovery of gene-regulation pathways using local causal search, Proc AMIA Symposium 2002:914-8. [Ye01] Yeung, K. et al, ” Model-based clustering and data transformations for gene expression data”, Bioinformatics 2001, 17: 977-987




All rights reserved Powered by 甜梦文库 9512.net

copyright ©right 2010-2021。