Inference of Population Structure Using Genetic Markers and a Bayesian Model Averaging Approach for Clustering

The analysis of the structure of populations on the basis of genetic data is essential in population genetics. It is used, for instance, to study the evolution of species or to correct for population stratification in association studies. These genetic data, normally based on DNA polymorphisms, may contain irrelevant information that biases the inference of population structure. In this paper we adapt a recently proposed algorithm, named multistart EMA, to be used in the inference of population structure. This algorithm is able to deal with irrelevant information when obtaining the (probabilistic) population partition. Additionally, we present a maker selection test able to obtain the most relevant markers to retrieve that population partition. The proposed algorithm is compared with the widely used STRUCTURE software on the basis of the F(ST) metric and the log-likelihood score. It is shown that the proposed algorithm improves the obtention of the population structure. Moreover, information about relevant markers obtained by the multi-start EMA can be used to improve the results obtained by other methods, correct for population stratification or even also reduce the economical cost of sequencing new samples. The software presented in this paper is available online at http://www.sc.ehu.es/ccwbayes/members/guzman.

[1]  Mikko Koivisto,et al.  Bayesian Learning with Mixtures of Trees , 2006, ECML.

[2]  L. Cavalli-Sforza,et al.  High resolution of human evolutionary trees with polymorphic microsatellites , 1994, Nature.

[3]  Nianjun Liu,et al.  PSMIX: an R package for population structure inference via maximum likelihood method , 2006, BMC Bioinformatics.

[4]  Gregory F. Cooper,et al.  Model Averaging for Prediction with Discrete Bayesian Networks , 2004, J. Mach. Learn. Res..

[5]  Stephen J. Tapscott,et al.  Genetic Structure of Human Populations , 2002 .

[6]  J. Corander,et al.  Bayesian identification of admixture events using multilocus molecular markers , 2006, Molecular ecology.

[7]  J.A. Lozano,et al.  Bayesian Model Averaging of Naive Bayes for Clustering , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[8]  Noah A. Rosenberg Algorithms for Selecting Informative Marker Panels for Population Assignment , 2005, J. Comput. Biol..

[9]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[10]  M. Feldman,et al.  Genetic Structure of Human Populations , 2002, Science.

[11]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[12]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[13]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[14]  Donald B. Rubin,et al.  Max-imum Likelihood from Incomplete Data , 1972 .

[15]  Arnaud Estoup,et al.  A Spatial Statistical Model for Landscape Genetics , 2005, Genetics.

[16]  Nir Friedman,et al.  Context-Specific Bayesian Clustering for Gene Expression Data , 2002, J. Comput. Biol..

[17]  M. Daly,et al.  Methods for high-density admixture mapping of disease genes. , 2004, American journal of human genetics.

[18]  M. Feldman,et al.  Clines, Clusters, and the Effect of Study Design on the Inference of Human Population Structure , 2005, PLoS genetics.

[19]  K J Dawson,et al.  A Bayesian approach to the identification of panmictic populations and the assignment of individuals. , 2001, Genetical research.

[20]  Jonathan Scott Friedlaender,et al.  A Human Genome Diversity Cell Line Panel , 2002, Science.

[21]  Gregory F. Cooper,et al.  A Bayesian method for the induction of probabilistic networks from data , 1992, Machine Learning.

[22]  Jukka Corander,et al.  BAPS 2: enhanced possibilities for the analysis of genetic population structure , 2004, Bioinform..

[23]  B S Weir,et al.  Estimation of the coancestry coefficient: basis for a short-term genetic distance. , 1983, Genetics.

[24]  Michael J Bamshad,et al.  Human population genetic structure and inference of group membership. , 2003, American journal of human genetics.

[25]  M. Sillanpää,et al.  Bayesian Association Mapping for Quantitative Traits in a Mixture of Two Populations , 2001, Genetic epidemiology.

[26]  N. Rosenberg distruct: a program for the graphical display of population structure , 2003 .

[27]  P. Donnelly,et al.  Association mapping in structured populations. , 2000, American journal of human genetics.

[28]  M. Stephens,et al.  Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. , 2003, Genetics.

[29]  R. Ward,et al.  Informativeness of genetic markers for inference of ancestry. , 2003, American journal of human genetics.

[30]  E. Eskin,et al.  Polymorphisms and Haplotypes of the Regulator of G Protein Signaling-2 Gene in Normotensives and Hypertensives , 2006, Hypertension.

[31]  Simon Easteal,et al.  Number of SNPS Loci Needed to Detect Population Structure , 2003, Human Heredity.

[32]  L. Cardon,et al.  Population stratification and spurious allelic association , 2003, The Lancet.