Predicting citation count of Bioinformatics papers within four years of publication

MOTIVATION Nowadays, publishers of scientific journals face the tough task of selecting high-quality articles that will attract as many readers as possible from a pool of articles. This is due to the growth of scientific output and literature. The possibility of a journal having a tool capable of predicting the citation count of an article within the first few years after publication would pave the way for new assessment systems. RESULTS This article presents a new approach based on building several prediction models for the Bioinformatics journal. These models predict the citation count of an article within 4 years after publication (global models). To build these models, tokens found in the abstracts of Bioinformatics papers have been used as predictive features, along with other features like the journal sections and 2-week post-publication periods. To improve the accuracy of the global models, specific models have been built for each Bioinformatics journal section (Data and Text Mining, Databases and Ontologies, Gene Expression, Genetics and Population Analysis, Genome Analysis, Phylogenetics, Sequence Analysis, Structural Bioinformatics and Systems Biology). In these new models, the average success rate for predictions using the naive Bayes and logistic regression supervised classification methods was 89.4% and 91.5%, respectively, within the nine sections and for 4-year time horizon. AVAILABILITY Supplementary material on this experimental survey is available at http://www.dia.fi.upm.es/~concha/bioinformatics.html CONTACT aibanez@fi.upm.es

[1]  Lokman I. Meho,et al.  Impact of data sources on citation counts and rankings of LIS faculty: Web of science versus scopus and google scholar , 2007, J. Assoc. Inf. Sci. Technol..

[2]  Judit Bar-Ilan,et al.  Which h-index? — A comparison of WoS, Scopus and Google Scholar , 2008, Scientometrics.

[3]  Gregory F. Cooper,et al.  A Bayesian method for the induction of probabilistic networks from data , 1992, Machine-mediated learning.

[4]  Stevan Harnad,et al.  Earlier Web Usage Statistics as Predictors of Later Citation Impact , 2005, J. Assoc. Inf. Sci. Technol..

[5]  Erik Cobo,et al.  Statistical Reviewers Improve Reporting in Biomedical Articles: A Randomized Trial , 2007, PloS one.

[6]  Lawrence D. Fu,et al.  Models for Predicting and Explaining Citation Count of Biomedical Articles , 2008, AMIA.

[7]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[8]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[9]  D. Hosmer,et al.  Applied Logistic Regression , 1991 .

[10]  LarrañagaPedro,et al.  A review of feature selection techniques in bioinformatics , 2007 .

[11]  Marvin Minsky,et al.  Steps toward Artificial Intelligence , 1995, Proceedings of the IRE.

[12]  Nathan I Cherny,et al.  Peer review in action: the contribution of referees to advancing reliable knowledge , 2005, Palliative medicine.

[13]  Pedro Larrañaga,et al.  Bioinformatics Advance Access published August 24, 2007 A review of feature selection techniques in bioinformatics , 2022 .

[14]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[15]  Ian Witten,et al.  Data Mining , 2000 .

[16]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[17]  Aristides Gionis,et al.  Estimating Number of Citations Using Author Reputation , 2007, SPIRE.

[18]  A. Mulligan,et al.  Is peer review in crisis? , 2005, Oral oncology.

[19]  Lokman I. Meho,et al.  Impact of data sources on citation counts and rankings of LIS faculty: Web of science versus scopus and google scholar , 2007 .

[20]  Gregory F. Cooper,et al.  A Bayesian method for the induction of probabilistic networks from data , 1992, Machine Learning.

[21]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[22]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[23]  K. A. McKibbon,et al.  Prediction of citation counts for clinical articles at two years using data available within three weeks of publication: retrospective cohort study , 2008, BMJ : British Medical Journal.

[24]  D. Horrobin Something rotten at the core of science? , 2001, Trends in pharmacological sciences.

[25]  Toni Scarpa Peer Review at NIH , 2006, Science.

[26]  Lutz Bornmann,et al.  What do citation counts measure? A review of studies on citing behavior , 2008, J. Documentation.

[27]  Richard Horton,et al.  Is peer review in crisis ? , 2004 .