Genetic algorithms and Gaussian Bayesian networks to uncover the predictive core set of bibliometric indices

The diversity of bibliometric indices today poses the challenge of exploiting the relationships among them. Our research uncovers the best core set of relevant indices for predicting other bibliometric indices. An added difficulty is to select the role of each variable, that is, which bibliometric indices are predictive variables and which are response variables. This results in a novel multioutput regression problem where the role of each variable (predictor or response) is unknown beforehand. We use Gaussian Bayesian networks to solve the this problem and discover multivariate relationships among bibliometric indices. These networks are learnt by a genetic algorithm that looks for the optimal models that best predict bibliometric data. Results show that the optimal induced Gaussian Bayesian networks corroborate previous relationships between several indices, but also suggest new, previously unreported interactions. An extended analysis of the best model illustrates that a set of 12 bibliometric indices can be accurately predicted using only a smaller predictive core subset composed of citations, g‐index, q2‐index, and hr‐index. This research is performed using bibliometric data on Spanish full professors associated with the computer science area.

[1]  David Maxwell Chickering,et al.  Learning Bayesian Networks is NP-Complete , 2016, AISTATS.

[2]  Massimo Franceschet,et al.  Journal influence factors , 2010, J. Informetrics.

[3]  Kenneth A. De Jong,et al.  An Analysis of the Interacting Roles of Population Size and Crossover in Genetic Algorithms , 1990, PPSN.

[4]  Francisco Herrera,et al.  q2-Index: Quantitative and qualitative evaluation based on the number and impact of papers in the Hirsch core , 2010, J. Informetrics.

[5]  Joe Whittaker,et al.  Edge Exclusion Tests for Graphical Gaussian Models , 1999, Learning in Graphical Models.

[6]  Rodrigo Costas,et al.  The h-index: Advantages, limitations and its relation with other bibliometric indicators at the micro level , 2007, J. Informetrics.

[7]  S. Lauritzen Propagation of Probabilities, Means, and Variances in Mixed Graphical Association Models , 1992 .

[8]  Leo Egghe,et al.  The Hirsch index and related impact measures , 2010, Annu. Rev. Inf. Sci. Technol..

[9]  Jerome H. FriedmanyNovember,et al.  Predicting Multivariate Responses in , 2013 .

[10]  Johan Bollen,et al.  A Principal Component Analysis of 39 Scientific Impact Measures , 2009, PloS one.

[11]  Concha Bielza,et al.  A review on probabilistic graphical models in evolutionary computation , 2012, J. Heuristics.

[12]  Emilio Delgado López-Cózar,et al.  Spanish personal name variations in national and international biomedical databases: implications for information retrieval and bibliometric studies. , 2002, Journal of the Medical Library Association : JMLA.

[13]  Finn V. Jensen,et al.  Bayesian Networks and Decision Graphs , 2001, Statistics for Engineering and Information Science.

[14]  Thierry Marchant,et al.  Score-based bibliometric rankings of authors , 2009, J. Assoc. Inf. Sci. Technol..

[15]  Richard E. Neapolitan Learning Bayesian Network Structure , 2009 .

[16]  José Manuel Gutiérrez,et al.  Expert Systems and Probabiistic Network Models , 1996 .

[17]  ROSA BLANCO,et al.  Gene Selection For Cancer Classification Using Wrapper Approaches , 2004, Int. J. Pattern Recognit. Artif. Intell..

[18]  Jose Miguel Puerta,et al.  Ant colony optimization for learning Bayesian networks , 2002, Int. J. Approx. Reason..

[19]  John J. Grefenstette,et al.  Optimization of Control Parameters for Genetic Algorithms , 1986, IEEE Transactions on Systems, Man, and Cybernetics.

[20]  M.M. Morales,et al.  A method based on genetic algorithms and fuzzy logic to induce Bayesian networks , 2004, Proceedings of the Fifth Mexican International Conference in Computer Science, 2004. ENC 2004..

[21]  Elizabeth S. Vieira,et al.  Definition of a model based on bibliometric indicators for assessing applicants to academic positions , 2014, J. Assoc. Inf. Sci. Technol..

[22]  Anja Vogler,et al.  An Introduction to Multivariate Statistical Analysis , 2004 .

[23]  Luis M. de Campos,et al.  Independency relationships and learning algorithms for singly connected networks , 1998, J. Exp. Theor. Artif. Intell..

[24]  Richard E. Neapolitan,et al.  Learning Bayesian networks , 2007, KDD '07.

[25]  Konrad Paul Kording,et al.  Future impact: Predicting scientific success , 2012, Nature.

[26]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[27]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[28]  Concha Bielza,et al.  Cost-sensitive selective naive Bayes classifiers for predicting the increase of the h-index for scientific journals , 2014, Neurocomputing.

[29]  Tom Burr,et al.  Causation, Prediction, and Search , 2003, Technometrics.

[30]  Dimitris Margaritis,et al.  Distribution-Free Learning of Bayesian Network Structure in Continuous Domains , 2005, AAAI.

[31]  Lutz Bornmann,et al.  Are there better indices for evaluation purposes than the h index? A comparison of nine different variants of the h index using data from biomedicine , 2008, J. Assoc. Inf. Sci. Technol..

[32]  Gregory F. Cooper,et al.  A Bayesian method for the induction of probabilistic networks from data , 1992, Machine Learning.

[33]  R.SIVARAJ,et al.  A REVIEW OF SELECTION METHODS IN GENETIC ALGORITHM , 2011 .

[34]  Ronald Rousseau,et al.  The power law model and total career h-index sequences , 2008, J. Informetrics.

[35]  Boaz Lerner,et al.  Bayesian Network Structure Learning by Recursive Autonomy Identification , 2009, J. Mach. Learn. Res..

[36]  N. Wermuth,et al.  Graphical Models for Associations between Variables, some of which are Qualitative and some Quantitative , 1989 .

[37]  Mike Thelwall,et al.  A combined bibliometric indicator to predict article impact , 2011, Inf. Process. Manag..

[38]  Ludo Waltman,et al.  The inconsistency of the h-index , 2011, J. Assoc. Inf. Sci. Technol..

[39]  C. Jack,et al.  Alzheimer's Disease Neuroimaging Initiative , 2008 .

[40]  R. Lewis,et al.  NEUROGASTROENTEROLOGISTS COMBINE OLD AND NEW RESEARCH APPROACHES , 1996 .

[41]  C. Robert Kenley,et al.  Gaussian influence diagrams , 1989 .

[42]  Lutz Bornmann,et al.  Are there better indices for evaluation purposes than the h index? A comparison of nine different variants of the h index using data from biomedicine , 2008, J. Assoc. Inf. Sci. Technol..

[43]  Xiaohui Liu,et al.  Evolutionary learning of dynamic probabilistic models with large time lags , 2001, Int. J. Intell. Syst..

[44]  Lawrence D. Fu,et al.  Using content-based and bibliometric features for machine learning models to predict citation counts in the biomedical literature , 2010, Scientometrics.

[45]  Loet Leydesdorff,et al.  How are new citation-based journal indicators adding to the bibliometric toolbox? , 2009, J. Assoc. Inf. Sci. Technol..

[46]  Dirk Thierens,et al.  Building a GA from Design Principles for Learning Bayesian Networks , 2003, GECCO.

[47]  Pedro Larrañaga,et al.  Analysis of the behaviour of genetic algorithms when learning Bayesian network structure from data , 1997, Pattern Recognit. Lett..

[48]  Massimo Franceschet,et al.  A cluster analysis of scholar and journal bibliometric indicators , 2009, J. Assoc. Inf. Sci. Technol..

[49]  Mônica G. Campiteli,et al.  Is it possible to compare researchers with different scientific interests? , 2006, Scientometrics.

[50]  Francisco Herrera,et al.  h-Index: A review focused in its variants, computation and standardization for different scientific fields , 2009, J. Informetrics.

[51]  Gregory M. Provan,et al.  Learning Bayesian Networks Using Feature Selection , 1995, AISTATS.

[52]  Yang Tao,et al.  A Study on Development Planning for Management Science and Engineering , 2006 .

[53]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[54]  Pablo Jensen,et al.  Testing bibliometric indicators by their prediction of scientists promotions , 2008, Scientometrics.

[55]  H. Akaike A new look at the statistical model identification , 1974 .

[56]  Chi-Hyuck Jun,et al.  Learning Bayesian network structure using Markov blanket decomposition , 2012, Pattern Recognit. Lett..

[57]  Pedro Larrañaga,et al.  Structure Learning of Bayesian Networks by Genetic Algorithms: A Performance Analysis of Control Parameters , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[58]  David Heckerman,et al.  Learning Gaussian Networks , 1994, UAI.

[59]  Thomas A. Runkler,et al.  Learning of Bayesian networks by a local discovery ant colony algorithm , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[60]  Leo Egghe Dynamic h-index: The Hirsch index in function of time: Brief Communication , 2007 .

[61]  Lutz Bornmann,et al.  Is the h index related to (standard) bibliometric measures and to the assessments by peers? An investigation of the h index by using molecular life sciences data , 2008 .

[62]  C. Jack,et al.  Alzheimer's Disease Neuroimaging Initiative , 2008 .

[63]  Michael Schreiber,et al.  An empirical investigation of the g-index for 26 physicists in comparison with the h-index, the A-index, and the R-index , 2008, J. Assoc. Inf. Sci. Technol..

[64]  Francisco Herrera,et al.  hg-index: a new index to characterize the scientific output of researchers based on the h- and g-indices , 2010, Scientometrics.

[65]  Md. Faizul Bari,et al.  Bayesian Network Structure Learning , 2010 .

[66]  Concha Bielza,et al.  Using Bayesian networks to discover relationships between bibliometric indices. A case study of computer science and artificial intelligence journals , 2011, Scientometrics.

[67]  Yannis Manolopoulos,et al.  Generalized Hirsch h-index for disclosing latent facts in citation networks , 2007, Scientometrics.

[68]  John M. Wilson,et al.  Comparing efficiencies of genetic crossover operators for one machine total weighted tardiness problem , 2008, Appl. Math. Comput..

[69]  Richard S. J. Tol,et al.  Rational (successive) h-indices: An application to economics in the Republic of Ireland , 2008, Scientometrics.

[70]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[71]  Evaristo Jiménez-Contreras,et al.  Reviewers’ Ratings and Bibliometric Indicators: Hand in Hand When Assessing Over Research Proposals? , 2013, PloS one.

[72]  Alexander von Eye,et al.  Forecasting trends of development of psychology from a bibliometric perspective , 2011, Scientometrics.

[73]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[74]  Enrique F. Castillo,et al.  Expert Systems and Probabilistic Network Models , 1996, Monographs in Computer Science.

[75]  Osame Kinouchi,et al.  An index to quantify an individual's scientific research valid across disciplines , 2005 .

[76]  David A. Bell,et al.  Learning Bayesian networks from data: An information-theory based approach , 2002, Artif. Intell..

[77]  J. E. Hirsch,et al.  An index to quantify an individual's scientific research output , 2005, Proc. Natl. Acad. Sci. USA.

[78]  John A. W. McCall,et al.  Two novel Ant Colony Optimization approaches for Bayesian network structure learning , 2010, IEEE Congress on Evolutionary Computation.

[79]  Kathryn B. Laskey,et al.  Learning Bayesian networks from incomplete data using evolutionary algorithms , 1999 .

[80]  Rodrigo Costas,et al.  Is g-index better than h-index? An exploratory study at the individual level , 2008, Scientometrics.

[81]  Jing Li,et al.  A Sparse Structure Learning Algorithm for Gaussian Bayesian Network Identification from High-Dimensional Data , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[82]  Elizabeth S. Vieira,et al.  How good is a model based on bibliometric indicators in predicting the final decisions made by peers? , 2014, J. Informetrics.

[83]  Concha Bielza,et al.  Predicting the h-index with cost-sensitive naive Bayes , 2011, 2011 11th International Conference on Intelligent Systems Design and Applications.

[84]  Concha Bielza,et al.  Predicting citation count of Bioinformatics papers within four years of publication , 2009, Bioinform..

[85]  Eugene Garfield,et al.  THE SIGNIFICANT SCIENTIFIC LITERATURE APPEARS IN A SMALL CORE OF JOURNALS , 1996 .

[86]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[87]  Concha Bielza,et al.  Learning an L1-Regularized Gaussian Bayesian Network in the Equivalence Class Space , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[88]  Peng Yu,et al.  Learning dynamic Bayesian network with immune evolutionary algorithm , 2005, 2005 International Conference on Machine Learning and Cybernetics.

[89]  J. Hirsch Does the h index have predictive power? , 2007, Proceedings of the National Academy of Sciences.

[90]  Leo Egghe,et al.  Dynamic h-index: The Hirsch index in function of time , 2007, J. Assoc. Inf. Sci. Technol..

[91]  Michael I. Jordan Graphical Models , 1998 .

[92]  F. J. Cabrerizoa,et al.  q 2-Index : Quantitative and qualitative evaluation based on the number and impact of papers in the Hirsch core , 2009 .

[93]  L. Egghe An improvement of the h-index: the g-index , 2006 .

[94]  Kevin P. Murphy,et al.  Learning the Structure of Dynamic Probabilistic Networks , 1998, UAI.

[95]  J. Friedman,et al.  Predicting Multivariate Responses in Multiple Linear Regression , 1997 .

[96]  José M. Soler A rational indicator of scientific creativity , 2007, J. Informetrics.

[97]  William M. Spears,et al.  A Study of Crossover Operators in Genetic Programming , 1991, ISMIS.

[98]  Igor Kissin,et al.  Can a bibliometric indicator predict the success of an analgesic? , 2011, Scientometrics.

[99]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1999, Innovations in Bayesian Networks.

[100]  Federico M. Stefanini,et al.  M-GA: A Genetic Algorithm to Search for the Best Conditional Gaussian Bayesian Network , 2005, International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC'06).