Supervised pre-processing approaches in multiple class variables classification for fish recruitment forecasting

A multi-species approach to fisheries management requires taking into account the interactions between species in order to improve recruitment forecasting of the fish species. Recent advances in Bayesian networks direct the learning of models with several interrelated variables to be forecasted simultaneously. These models are known as multi-dimensional Bayesian network classifiers (MDBNs). Pre-processing steps are critical for the posterior learning of the model in these kinds of domains. Therefore, in the present study, a set of 'state-of-the-art' uni-dimensional pre-processing methods, within the categories of missing data imputation, feature discretization and feature subset selection, are adapted to be used with MDBNs. A framework that includes the proposed multi-dimensional supervised pre-processing methods, coupled with a MDBN classifier, is tested with synthetic datasets and the real domain of fish recruitment forecasting. The correctly forecasting of three fish species (anchovy, sardine and hake) simultaneously is doubled (from 17.3% to 29.5%) using the multi-dimensional approach in comparison to mono-species models. The probability assessments also show high improvement reducing the average error (estimated by means of Brier score) from 0.35 to 0.27. Finally, these differences are superior to the forecasting of species by pairs. Highlights? We propose supervised filter pre-processing methods for multi-dimensional classification. ? The pre-processing methods and circumstances with a superior behaviour are identified. ? We show the application to forecasting the recruitment of multiple fish species. ? The multi-dimensional approach improves the forecasting of each species recruitment. ? It improves simultaneous forecasting of all species and probability estimates.

[1]  J. Castilla,et al.  The management of fisheries and marine ecosystems , 1997 .

[2]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[3]  N. Knowlton,et al.  Marine Ecosystem-based Management in Practice: Scientific and Governance Challenges , 2008 .

[4]  Daniel Marcu,et al.  Learning as search optimization: approximate large margin methods for structured prediction , 2005, ICML.

[5]  Yoshua Bengio,et al.  Série Scientifique Scientific Series No Unbiased Estimator of the Variance of K-fold Cross-validation No Unbiased Estimator of the Variance of K-fold Cross-validation , 2022 .

[6]  David Nash,et al.  Using Monte-Carlo simulations and Bayesian Networks to quantify and demonstrate the impact of fertiliser best management practices , 2011, Environ. Model. Softw..

[7]  Philippe Cury,et al.  The functioning of marine ecosystems: a fisheries perspective. , 2003 .

[8]  Rafael Rumí,et al.  Hybrid Bayesian network classifiers: Application to species distribution models , 2010, Environ. Model. Softw..

[9]  Juha Reunanen,et al.  Overfitting in Making Comparisons Between Variable Selection Methods , 2003, J. Mach. Learn. Res..

[10]  Serena H. Chen,et al.  Good practice in Bayesian network modelling , 2012, Environ. Model. Softw..

[11]  Chao Sima,et al.  Performance of Feature Selection Methods , 2009, Current genomics.

[12]  James N. Ianelli,et al.  Bayesian stock assessment using catch-age data and the sampling - importance resampling algorithm , 1997 .

[13]  Pierre Geurts,et al.  Investigation and Reduction of Discretization Variance in Decision Tree Induction , 2000, ECML.

[14]  Kate Smith-Miles,et al.  On learning algorithm selection for classification , 2006, Appl. Soft Comput..

[15]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[16]  Xabier Irigoien,et al.  Reply to Horizons Article ‘Castles built on sand: dysfunctionality in plankton models and the inadequacy of dialogue between biologists and modellers’ Flynn (2005). Shiny mathematical castles built on grey biological sands , 2006 .

[17]  W. Ricker Stock and Recruitment , 1954 .

[18]  E. Fulton,et al.  Effect of complexity on marine ecosystem models , 2003 .

[19]  Judea Pearl,et al.  Bayesian Networks , 1998, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[20]  Jason Link,et al.  Re)Constructing Food Webs and Managing Fisheries , 1999 .

[21]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[22]  S. T. Buckland,et al.  Hidden process models for animal population dynamics. , 2006, Ecological applications : a publication of the Ecological Society of America.

[23]  Thanh Ha Dang,et al.  Using Entropy to Impute Missing Data in a Classification Task , 2007, 2007 IEEE International Fuzzy Systems Conference.

[24]  Ron Kohavi,et al.  Feature Subset Selection Using the Wrapper Method: Overfitting and Dynamic Search Space Topology , 1995, KDD.

[25]  Constantin F. Aliferis,et al.  Causal Feature Selection , 2007 .

[26]  Chun Kiat Chang,et al.  Machine Learning Approach to Predict Sediment Load – A Case Study , 2010 .

[27]  John Turner,et al.  Model uncertainty in the ecosystem approach to fisheries , 2007 .

[28]  Benjamin Planque,et al.  Quantile regression models for fish recruitment-environment relationships : four case studies , 2008 .

[29]  Reza Kerachian,et al.  Developing monthly operating rules for a cascade system of reservoirs: Application of Bayesian Networks , 2009, Environ. Model. Softw..

[30]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[31]  Eibe Frank,et al.  Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms , 2004, PAKDD.

[32]  Pedro Larrañaga,et al.  Information Theory and Classification Error in Probabilistic Classifiers , 2006, Discovery Science.

[33]  L. Fortier,et al.  Cannibalism and predation of fish larvae by larvae of Atlantic mackerel, Scomber scombrus: trophodynamics and potential impact on recruitment , 1996 .

[34]  Luis Enrique Sucar,et al.  A Two-Step Method to Learn Multidimensional Bayesian Network Classifiers Based on Mutual Information Measures , 2011, FLAIRS.

[35]  Boris Kompare,et al.  Environmental Modelling & Software , 2014 .

[36]  S. García,et al.  An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons , 2008 .

[37]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[38]  Ding-Geng Chen,et al.  Recruitment prediction with genetic algorithms with application to the Pacific Herring fishery , 2007 .

[39]  Linda C. van der Gaag,et al.  Inference and Learning in Multi-dimensional Bayesian Network Classifiers , 2007, ECSQARU.

[40]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[41]  Pedro Larrañaga,et al.  Learning Bayesian networks in the space of structures by estimation of distribution algorithms , 2003, Int. J. Intell. Syst..

[42]  T. Aqorau,et al.  Obligations to protect marine ecosystems under international conventions and other legal instruments. , 2002 .

[43]  D. Ragozin,et al.  Harvest policies and nonmarket valuation in a predator -- prey system , 1985 .

[44]  Robert P. W. Duin,et al.  Using two-class classifiers for multiclass classification , 2002, Object recognition supported by user interaction for service robots.

[45]  Pedro Larrañaga,et al.  Bioinformatics Advance Access published August 24, 2007 A review of feature selection techniques in bioinformatics , 2022 .

[46]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[47]  Jose A. Lozano,et al.  Fish recruitment prediction, using robust supervised classification methods , 2010 .

[48]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[49]  K. Cochrane,et al.  Ecosystem approach to fisheries: a review of implementation guidelines , 2005 .

[50]  T. Essington,et al.  The precautionary approach in fisheries management: the devil is in the details. , 2001, Trends in ecology & evolution.

[51]  Concha Bielza,et al.  Multi-dimensional classification with Bayesian networks , 2011, Int. J. Approx. Reason..

[52]  Laura Uusitalo,et al.  Advantages and challenges of Bayesian networks in environmental modelling , 2007 .

[53]  Beatriz A. Roel,et al.  A two-stage biomass dynamic model for Bay of Biscay anchovy: a Bayesian approach , 2008 .

[54]  Ivan Bratko,et al.  Experiments in automatic learning of medical diagnostic rules , 1984 .

[55]  J. A. Lozano,et al.  Optimizing the number of classes in automated zooplankton classification , 2009 .

[56]  H. Md. Azamathulla,et al.  Support vector machine approach for longitudinal dispersion coefficients in natural streams , 2011, Appl. Soft Comput..

[57]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[58]  Jose A. Lozano,et al.  A sensitivity study of bias and variance of k-fold cross-validation in prediction error estimation , 2009 .

[59]  Ronen Feldman,et al.  The Data Mining and Knowledge Discovery Handbook , 2005 .

[60]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[61]  Gary R. Weckman,et al.  Modeling net ecosystem metabolism with an artificial neural network and Bayesian belief network , 2011, Environ. Model. Softw..

[62]  Linda C. van der Gaag,et al.  Multi-dimensional Bayesian Network Classifiers , 2006, Probabilistic Graphical Models.

[63]  William E. Ricker Handbook of computations for biological statistics of fish populations , 1960 .

[64]  Igor Kononenko,et al.  On Biases in Estimating Multi-Valued Attributes , 1995, IJCAI.

[65]  Claudio Gentile,et al.  Incremental Algorithms for Hierarchical Classification , 2004, J. Mach. Learn. Res..

[66]  Rafael Rumí,et al.  Bayesian networks in environmental modelling , 2011, Environ. Model. Softw..

[67]  Silja Renooij,et al.  Probabilities for a probabilistic network: a case study in oesophageal cancer , 2002, Artif. Intell. Medicine.

[68]  Kevin J. Flynn,et al.  Castles built on sand : dysfunctionality in plankton models and the inadequacy of dialogue between biologists and modellers , 2005 .

[69]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[70]  W. C. Leggett,et al.  Recruitment in marine fishes: Is it regulated by starvation and predation in the egg and larval stages? , 1994 .

[71]  Jeremy S. Collie,et al.  Are multispecies models an improvement on single-species models for measuring fishing impacts on marine ecosystems? , 2000 .

[72]  X. Irigoien,et al.  The role of intraguild predation in the population dynamics of small pelagic fish , 2011 .

[73]  Concha Bielza,et al.  Comparison of Bayesian networks and artificial neural networks for quality detection in a machining process , 2009, Expert Syst. Appl..

[74]  Gunnar Stefansson,et al.  The potential use of a Gadget model to predict stock responses to climate change in combination with Bayesian networks: the case of Bay of Biscay anchovy , 2011 .

[75]  Barbara Rountree,et al.  Portfolio management of wild fish stocks , 2004 .

[76]  Sakari Kuikka,et al.  Evaluation of reaching the targets of the water framework directive in the Gulf of Finland. , 2012, Environmental science & technology.

[77]  I Inza,et al.  Representing the behaviour of supervised classification learning algorithms by Bayesian networks , 1999, Pattern Recognit. Lett..

[78]  Yoshua Bengio,et al.  Inference for the Generalization Error , 1999, Machine Learning.

[79]  D. Duplisea,et al.  Bioinformatics tools in predictive ecology: applications to fisheries , 2012, Philosophical Transactions of the Royal Society B: Biological Sciences.

[80]  J. J. Colbert,et al.  Interannual changes in sablefish (Anoplopoma fimbria) recruitment in relation to oceanographic conditions within the California Current System , 2006 .

[81]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[82]  Jason Catlett,et al.  On Changing Continuous Attributes into Ordered Discrete Attributes , 1991, EWSL.

[83]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[84]  José Ramón Quevedo,et al.  Multilabel classifiers with a probabilistic thresholding strategy , 2012, Pattern Recognit..

[85]  Rich Caruana,et al.  Multitask Learning , 1997, Machine Learning.

[86]  Dimitris Kanellopoulos,et al.  Data Preprocessing for Supervised Leaning , 2007 .

[87]  José Antonio Lozano,et al.  Multi-Objective Learning of Multi-Dimensional Bayesian Classifiers , 2008, 2008 Eighth International Conference on Hybrid Intelligent Systems.