Multi-marker tagging single nucleotide polymorphism selection using estimation of distribution algorithms

OBJECTIVES This paper presents an optimization algorithm for the automatic selection of a minimal subset of tagging single nucleotide polymorphisms (SNPs). METHODS AND MATERIALS The determination of the set of minimal tagging SNPs is approached as an optimization problem in which each tagged SNP can be covered by a single tagging SNP or by a pair of tagging SNPs. The problem is solved using an estimation of distribution algorithm (EDA) which takes advantage of the underlying topological structure defined by the SNP correlations to model the problem interactions. The EDA stochastically searches the constrained space of feasible solutions. It is evaluated across HapMap reference panel data sets. RESULTS The EDA was compared with a SAT solver, able to find the single-marker minimal tagging sets, and with the Tagger program. The percentage of reduction ranged from 10% to 43% in the number of tagging SNPs of the minimal multi-marker tagging set found by the EDA with respect to the other algorithms. CONCLUSIONS The introduced algorithm is effective for the identification of minimal multi-marker SNP sets, which considerably reduce the dimension of the tagging SNP set in comparison with single-marker sets. Other variants of the SNP problem can be treated following the same approach.

[1]  Max Henrion,et al.  Propagating uncertainty in bayesian networks by probabilistic logic sampling , 1986, UAI.

[2]  Pedro Larrañaga,et al.  Parallel Estimation of Distribution Algorithms , 2002, Estimation of Distribution Algorithms.

[3]  Pedro Larrañaga,et al.  Estimation of Distribution Algorithms , 2002, Genetic Algorithms and Evolutionary Computation.

[4]  Zhaohui S. Qin,et al.  A second generation human haplotype map of over 3.1 million SNPs , 2007, Nature.

[5]  H. Mühlenbein,et al.  From Recombination of Genes to the Estimation of Distributions I. Binary Parameters , 1996, PPSN.

[6]  Dan Geiger,et al.  High density linkage disequilibrium mapping using models of haplotype block variation , 2004, ISMB/ECCB.

[7]  Concha Bielza,et al.  Machine Learning in Bioinformatics , 2008, Encyclopedia of Database Systems.

[8]  Mark Daly,et al.  Haploview: analysis and visualization of LD and haplotype maps , 2005, Bioinform..

[9]  Shumeet Baluja,et al.  Incorporating a priori Knowledge in Probabilistic-Model Based Optimization , 2006, Scalable Optimization via Probabilistic Modeling.

[10]  R. Santana,et al.  The mixture of trees Factorized Distribution Algorithm , 2001 .

[11]  Concha Bielza,et al.  A review of estimation of distribution algorithms in bioinformatics , 2008, BioData Mining.

[12]  Russell Schwartz,et al.  Haplotypes and informative SNP selection algorithms: don't block out information , 2003, RECOMB '03.

[13]  Pedro Larrañaga,et al.  The Role of a Priori Information in the Minimization of Contact Potentials by Means of Estimation of Distribution Algorithms , 2007, EvoBIO.

[14]  David E. Goldberg,et al.  Using Previous Models to Bias Structural Learning in the Hierarchical BOA , 2008, Evolutionary Computation.

[15]  Roberto Santana,et al.  The Factorized Distribution Algorithm and The Junction Tree: A Learning Perspective , 2005 .

[16]  Hector J. Levesque,et al.  A New Method for Solving Hard Satisfiability Problems , 1992, AAAI.

[17]  Hagit Shatkay,et al.  BNTagger: improved tagging SNP selection using Bayesian networks , 2006, ISMB.

[18]  Andres Metspalu,et al.  An Evaluation of the Performance of Tag SNPs Derived from HapMap in a Caucasian Population , 2006, PLoS genetics.

[19]  Martin Pelikan,et al.  Enhancing Efficiency of Hierarchical BOA Via Distance-Based Model Restrictions , 2008, PPSN.

[20]  Martin Pelikan,et al.  Hierarchical Bayesian optimization algorithm: toward a new generation of evolutionary algorithms , 2010, SICE 2003 Annual Conference (IEEE Cat. No.03TH8734).

[21]  A. G. Heidema,et al.  The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases , 2006, BMC Genetics.

[22]  S. Gabriel,et al.  Efficiency and power in genetic association studies , 2005, Nature Genetics.

[23]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[24]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[25]  Leah E. Mechanic,et al.  Exploring SNP‐SNP interactions and colon cancer risk using polymorphism interaction analysis , 2006, International journal of cancer.

[26]  Paul A. Viola,et al.  MIMIC: Finding Optima by Estimating Probability Densities , 1996, NIPS.

[27]  Alexander Mendiburu,et al.  Parallel implementation of EDAs based on probabilistic graphical models , 2005, IEEE Transactions on Evolutionary Computation.

[28]  J. A. Lozano,et al.  Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation , 2001 .

[29]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[30]  Shumeet Baluja,et al.  Using Optimal Dependency-Trees for Combinational Optimization , 1997, ICML.

[31]  Rajkumar Roy,et al.  Advances in Soft Computing: Engineering Design and Manufacturing , 1998 .

[32]  Stephen J. Chanock,et al.  Polymorphism Interaction Analysis (PIA): a method for investigating complex gene-gene interactions , 2008, BMC Bioinformatics.

[33]  Eleazar Eskin,et al.  Efficient Genome Wide Tagging by Reduction to SAT , 2008, WABI.

[34]  A. Ochoa,et al.  A factorized distribution algorithm based on polytrees , 2000, Proceedings of the 2000 Congress on Evolutionary Computation. CEC00 (Cat. No.00TH8512).

[35]  S. Baluja,et al.  Using Optimal Dependency-Trees for Combinatorial Optimization: Learning the Structure of the Search Space , 1997 .

[36]  Francis S Collins,et al.  A HapMap harvest of insights into the genetics of common disease. , 2008, The Journal of clinical investigation.

[37]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[38]  Pedro Larrañaga,et al.  Adding Probabilistic Dependencies to the Search of Protein Side Chain Configurations Using EDAs , 2008, PPSN.

[39]  Roded Sharan,et al.  Bayesian haplo-type inference via the dirichlet process , 2004, ICML.

[40]  Heinz Mühlenbein,et al.  Schemata, Distributions and Graphical Models in Evolutionary Optimization , 1999, J. Heuristics.

[41]  David E. Goldberg,et al.  Hierarchical Bayesian Optimization Algorithm , 2006, Scalable Optimization via Probabilistic Modeling.

[42]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[43]  Shumeet Baluja,et al.  A Method for Integrating Genetic Search Based Function Optimization and Competitive Learning , 1994 .

[44]  J. A. Lozano,et al.  Towards a New Evolutionary Computation: Advances on Estimation of Distribution Algorithms (Studies in Fuzziness and Soft Computing) , 2006 .

[45]  Martin Pelikan,et al.  Scalable Optimization via Probabilistic Modeling: From Algorithms to Applications (Studies in Computational Intelligence) , 2006 .

[46]  Zhen Lin,et al.  Choosing Snps Using Feature Selection , 2006, J. Bioinform. Comput. Biol..

[47]  David E. Goldberg,et al.  The compact genetic algorithm , 1999, IEEE Trans. Evol. Comput..

[48]  M. Pelikán,et al.  The Bivariate Marginal Distribution Algorithm , 1999 .

[49]  S. Gabriel,et al.  The Structure of Haplotype Blocks in the Human Genome , 2002, Science.