Subtree power analysis and species selection for comparative genomics

Sequence comparison across multiple organisms aids in the detection of regions under selection. However, resource limitations require a prioritization of genomes to be sequenced. This prioritization should be grounded in two considerations: the lineal scope encompassing the biological phenomena of interest, and the optimal species within that scope for detecting functional elements. We introduce a statistical framework for optimal species subset selection, based on maximizing power to detect conserved sites. Analysis of a phylogenetic star topology shows theoretically that the optimal species subset is not in general the most evolutionarily diverged subset. We then demonstrate this finding empirically in a study of vertebrate species. Our results suggest that marsupials are prime sequencing candidates.

[1]  Hideo Matsuda,et al.  fastDNAmL: a tool for construction of phylogenetic trees of DNA sequences using maximum likelihood , 1994, Comput. Appl. Biosci..

[2]  Lior Pachter,et al.  Multiple-sequence functional annotation and the generalized hidden Markov phylogeny , 2004, Bioinform..

[3]  J. W. Thomas,et al.  Comparative analyses of multi-species sequences from targeted genomic regions , 2003, Nature.

[4]  Arend Sidow,et al.  Sequence First. Ask Questions Later. , 2002, Cell.

[5]  Lior Pachter,et al.  MAVID: constrained ancestral alignment of multiple sequences. , 2003, Genome research.

[6]  S. Kasif,et al.  Human-mouse gene identification by comparative evidence integration and evolutionary analysis. , 2003, Genome research.

[7]  Alexandre Reymond,et al.  Evolutionary Discrimination of Mammalian Conserved Non-Genic Sequences (CNGs) , 2003, Science.

[8]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[9]  S. O’Brien,et al.  On Choosing Mammalian Genomes for Sequencing , 2001, Science.

[10]  S. Batzoglou,et al.  Quantitative estimates of sequence divergence for comparative analyses of mammalian genomes. , 2003, Genome research.

[11]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[12]  Lisa M. D'Souza,et al.  Genome sequence of the Brown Norway rat yields insights into mammalian evolution , 2004, Nature.

[13]  Jon D. McAuliffe,et al.  Phylogenetic Shadowing of Primate Sequences to Find Functional Regions of the Human Genome , 2003, Science.

[14]  J. Felsenstein,et al.  A Hidden Markov Model approach to variation among sites in rate of evolution. , 1996, Molecular biology and evolution.

[15]  Irene A. Stegun,et al.  Handbook of Mathematical Functions. , 1966 .

[16]  Webb Miller,et al.  Comparative genome analysis delimits a chromosomal domain and identifies key regulatory elements in the α globin cluster , 2001 .

[17]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[18]  M. Nóbrega,et al.  Comparative genomics at the vertebrate extremes , 2004, Nature Reviews Genetics.

[19]  Stephen E. Fienberg,et al.  Testing Statistical Hypotheses , 2005 .

[20]  Lior Pachter,et al.  Identification of evolutionary hotspots in the rodent genomes. , 2004, Genome research.

[21]  D. Haussler,et al.  Article Identification and Characterization of Multi-Species Conserved Sequences , 2022 .

[22]  Berthold Göttgens,et al.  Analysis of multiple genomic sequence alignments: a web resource, online tools, and lessons learned from analysis of mammalian SCL loci. , 2004, Genome research.