cn.FARMS: a latent variable model to detect copy number variations in microarray data with a low false discovery rate

Cost-effective oligonucleotide genotyping arrays like the Affymetrix SNP 6.0 are still the predominant technique to measure DNA copy number variations (CNVs). However, CNV detection methods for microarrays overestimate both the number and the size of CNV regions and, consequently, suffer from a high false discovery rate (FDR). A high FDR means that many CNVs are wrongly detected and therefore not associated with a disease in a clinical study, though correction for multiple testing takes them into account and thereby decreases the study's discovery power. For controlling the FDR, we propose a probabilistic latent variable model, ‘cn.FARMS’, which is optimized by a Bayesian maximum a posteriori approach. cn.FARMS controls the FDR through the information gain of the posterior over the prior. The prior represents the null hypothesis of copy number 2 for all samples from which the posterior can only deviate by strong and consistent signals in the data. On HapMap data, cn.FARMS clearly outperformed the two most prevalent methods with respect to sensitivity and FDR. The software cn.FARMS is publicly available as a R package at http://www.bioinf.jku.at/software/cnfarms/cnfarms.html.

[1]  Tomas W. Fitzgerald,et al.  Origins and functional impact of copy number variation in the human genome , 2010, Nature.

[2]  Marco A. Marra,et al.  Assessment of algorithms for high throughput detection of genomic copy number variation in oligonucleotide microarray data , 2007, BMC Bioinformatics.

[3]  Seang-Mei Saw,et al.  Comparative analyses of seven algorithms for copy number variant identification from single nucleotide polymorphism arrays , 2010, Nucleic acids research.

[4]  S. Ogawa,et al.  Genome-wide, high-resolution detection of copy number, loss of heterozygosity, and genotypes from formalin-fixed, paraffin-embedded tumor tissue using microarrays. , 2007, Cancer research.

[5]  Jake K. Byrnes,et al.  Genome-wide association study of copy number variation in 16,000 cases of eight common diseases and 3,000 shared controls , 2010 .

[6]  Andreas D. Baxevanis,et al.  The Molecular Biology Database Collection: 2003 update , 2003, Nucleic Acids Res..

[7]  Ingo Ruczinski,et al.  Hidden Markov models for the assessment of chromosomal alterations using high-throughput SNP arrays. , 2008, The annals of applied statistics.

[8]  Matthew E Hurles,et al.  The population genetics of structural variation , 2007, Nature Genetics.

[9]  B. Rovin,et al.  The Influence of CCL 3 L 1 Gene – Containing Segmental Duplications on HIV-1 / AIDS Susceptibility , 2009 .

[10]  Philippe Froguel,et al.  FCGR3B copy number variation is associated with susceptibility to systemic, but not organ-specific, autoimmunity , 2007, Nature Genetics.

[11]  Shigeru Chiba,et al.  A robust algorithm for copy number detection using high-density oligonucleotide single nucleotide polymorphism genotyping arrays. , 2005, Cancer research.

[12]  Sylvia Richardson,et al.  Detection of gene copy number changes in CGH microarrays using a spatially correlated mixture model , 2006, Bioinform..

[13]  Terrence J. Sejnowski,et al.  Learning Overcomplete Representations , 2000, Neural Computation.

[14]  G. Abecasis,et al.  A Genome-Wide Association Study of Type 2 Diabetes in Finns Detects Multiple Susceptibility Variants , 2007, Science.

[15]  R. Gentleman,et al.  Independent filtering increases detection power for high-throughput experiments , 2010, Proceedings of the National Academy of Sciences.

[16]  Cheng Li,et al.  dChipSNP: significance curve and clustering of SNP-array-based loss-of-heterozygosity data , 2004, Bioinform..

[17]  Franck Picard,et al.  Preprocessing and downstream analysis of microarray DNA copy number profiles , 2011, Briefings Bioinform..

[18]  Terence P. Speed,et al.  Estimation and assessment of raw copy numbers at the single locus level , 2008, Bioinform..

[19]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[20]  C. Li,et al.  Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Hinrich W. H. Göhlmann,et al.  I/NI-calls for the exclusion of non-informative genes: a highly effective filtering tool for microarray data , 2007, Bioinform..

[22]  Hinrich W. H. Göhlmann,et al.  Genome‐wide copy number alterations detection in fresh frozen and matched FFPE samples using SNP 6.0 arrays , 2008, Genes, chromosomes & cancer.

[23]  T. Speed,et al.  Summaries of Affymetrix GeneChip probe level data. , 2003, Nucleic acids research.

[24]  CNAT 4 . 0 : Copy Number and Loss of Heterozygosity Estimation Algorithms for the GeneChip ® Human Mapping 10 / 50 / 100 / 250 / 500 K Array Set , 2007 .

[25]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[26]  Joshua M. Korn,et al.  Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs , 2008, Nature Genetics.

[27]  Joseph T. Glessner,et al.  PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. , 2007, Genome research.

[28]  D. Conrad,et al.  Global variation in copy number in the human genome , 2006, Nature.

[29]  Jake K. Byrnes,et al.  Genome-wide association study of copy number variation in 16,000 cases of eight common diseases and 3,000 shared controls , 2010, Nature.

[30]  Klaus Obermayer,et al.  A new summarization method for affymetrix probe level data , 2006, Bioinform..

[31]  Mark A. Girolami,et al.  A Variational Method for Learning Sparse and Overcomplete Representations , 2001, Neural Computation.

[32]  Jan Komorowski,et al.  A segmental maximum a posteriori approach to genome-wide copy number profiling , 2008, Bioinform..

[33]  Terence P. Speed,et al.  A single-array preprocessing method for estimating full-resolution raw copy numbers from all Affymetrix genotyping arrays including GenomeWideSNP 5 & 6 , 2009, Bioinform..

[34]  Emmanuel Barillot,et al.  Analysis of array CGH data: from signal ratio to gain and loss of DNA regions , 2004, Bioinform..

[35]  Adetayo Kasim,et al.  Filtering data from high-throughput experiments based on measurement reliability , 2010, Proceedings of the National Academy of Sciences.

[36]  Xavier Estivill,et al.  Copy Number Variants and Common Disorders: Filling the Gaps and Exploring Complexity in Genome-Wide Association Studies , 2007, PLoS genetics.

[37]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[38]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[39]  M. McCarthy,et al.  Replication of Genome-Wide Association Signals in UK Samples Reveals Risk Loci for Type 2 Diabetes , 2007, Science.

[40]  Zachary A. Szpiech,et al.  Genotype, haplotype and copy-number variation in worldwide human populations , 2008, Nature.

[41]  Joshua M. Korn,et al.  Integrated detection and population-genetic analysis of SNPs and copy number variation , 2008, Nature Genetics.

[42]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[43]  T. Frayling Genome–wide association studies provide new insights into type 2 diabetes aetiology , 2007, Nature Reviews Genetics.