Are clusters found in one dataset present in another dataset?

In many microarray studies, a cluster defined on one dataset is sought in an independent dataset. If the cluster is found in the new dataset, the cluster is said to be "reproducible" and may be biologically significant. Classifying a new datum to a previously defined cluster can be seen as predicting which of the previously defined clusters is most similar to the new datum. If the new data classified to a cluster are similar, molecularly or clinically, to the data already present in the cluster, then the cluster is reproducible and the corresponding prediction accuracy is high. Here, we take advantage of the connection between reproducibility and prediction accuracy to develop a validation procedure for clusters found in datasets independent of the one in which they were characterized. We define a cluster quality measure called the "in-group proportion" (IGP) and introduce a general procedure for individually validating clusters. Using simulations and real breast cancer datasets, the IGP is compared to four other popular cluster quality measures (homogeneity score, separation score, silhouette width, and weighted average discrepant pairs score). Moreover, simulations and the real breast cancer datasets are used to compare the four versions of the validation procedure which all use the IGP, but differ in the way in which the null distributions are generated. We find that the IGP is the best measure of prediction accuracy, and one version of the validation procedure is the more widely applicable than the other three. An implementation of this algorithm is in a package called "clusterRepro" available through The Comprehensive R Archive Network (http://cran.r-project.org).

[1]  Karni Ginzburg,et al.  Stages of Breast Cancer , 2000 .

[2]  Susmita Datta,et al.  Comparisons and validation of statistical clustering techniques for microarray gene expression data , 2003, Bioinform..

[3]  Michael Q. Zhang,et al.  Evaluation and comparison of clustering algorithms in analyzing es cell gene expression data , 2002 .

[4]  V R Savran,et al.  [Stages of breast cancer]. , 2006, Likars'ka sprava.

[5]  Richard C. Dubes,et al.  Cluster validity profiles , 1982, Pattern Recognit..

[6]  M K Kerr,et al.  Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[8]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[9]  Robert Tibshirani,et al.  Cluster Validation by Prediction Strength , 2005 .

[10]  J. William Ahwood,et al.  CLASSIFICATION , 1931, Foundations of Familiar Language.

[11]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[12]  R. Tibshirani,et al.  Repeated observation of breast tumor subtypes in independent gene expression data sets , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[13]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[14]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[15]  Eytan Domany,et al.  Resampling Method for Unsupervised Estimation of Cluster Validity , 2001, Neural Computation.

[16]  Ronald Bodley Sir Scott Cancer : the facts , 1979 .

[17]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[18]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.