When Huge Is Routine: Scaling Genetic Algorithms and Estimation of Distribution Algorithms via Data-Intensive Computing

Data-intensive computing has emerged as a key player for processing large volumes of data exploiting massive parallelism. Data-intensive computing frameworks have shown that terabytes and petabytes of data can be routinely processed. However, there has been little effort to explore how data-intensive computing can help scale evolutionary computation. In this book chapter we explore how evolutionary computation algorithms can be modeled using two different data-intensive frameworks—Yahoo!’s Hadoop and NCSA’s Meandre. We present a detailed step-by-step description of how three different evolutionary computation algorithms, having different execution profiles, can be translated into the data-intensive computing paradigms. Results show that (1) Hadoop is an excellent choice to push evolutionary computation boundaries on very large problems, and (2) that transparent Meandre linear speedups are possible without changing the underlying data-intensive flow thanks to its inherent parallel processing.

[1]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[2]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[3]  J. David Schaffer,et al.  Proceedings of the third international conference on Genetic algorithms , 1989 .

[4]  Gilbert Syswerda,et al.  Uniform Crossover in Genetic Algorithms , 1989, ICGA.

[5]  Kalyanmoy Deb,et al.  Messy Genetic Algorithms: Motivation, Analysis, and First Results , 1989, Complex Syst..

[6]  Kalyanmoy Deb,et al.  Genetic Algorithms, Noise, and the Sizing of Populations , 1992, Complex Syst..

[7]  Akihiko Konagaya,et al.  A Fine-Grained Parallel Genetic Algorithm for Distributed Parallel Systems , 1993, ICGA.

[8]  Stephanie Forrest,et al.  Proceedings of the 5th International Conference on Genetic Algorithms , 1993 .

[9]  Shumeet Baluja,et al.  A Method for Integrating Genetic Search Based Function Optimization and Competitive Learning , 1994 .

[10]  Erik D. Goodman,et al.  Coarse-grain parallel genetic algorithms: categorization and new approach , 1994, Proceedings of 1994 6th IEEE Symposium on Parallel and Distributed Processing.

[11]  Kenneth A. De Jong,et al.  On Decentralizing Selection Algorithms , 1995, ICGA.

[12]  Rich Caruana,et al.  Removing the Genetics from the Standard Genetic Algorithm , 1995, ICML.

[13]  Hans-Paul Schwefel,et al.  Parallel Problem Solving from Nature — PPSN IV , 1996, Lecture Notes in Computer Science.

[14]  H. Mühlenbein,et al.  From Recombination of Genes to the Estimation of Distributions I. Binary Parameters , 1996, PPSN.

[15]  Michael F. P. O'Boyle,et al.  A Compiler Strategy for Shared Virtual Memories , 1996 .

[16]  Heinz Mühlenbein,et al.  The Equation for Response to Selection and Its Use for Prediction , 1997, Evolutionary Computation.

[17]  Kenneth A. De Jong,et al.  An Analysis of Local Selection Algorithms in a Spatially Structured Evolutionary Algorithm , 1997, ICGA.

[18]  E. Cantu-Paz,et al.  The Gambler's Ruin Problem, Genetic Algorithms, and the Sizing of Populations , 1997, Evolutionary Computation.

[19]  T. Crainic,et al.  Parallel Meta-Heuristics , 2010 .

[20]  Michael Mikolajczak,et al.  Designing And Building Parallel Programs: Concepts And Tools For Parallel Software Engineering , 1997, IEEE Concurrency.

[21]  K. De Jong,et al.  Selection pressure and performance in spatially distributed evolutionary algorithms , 1998, 1998 IEEE International Conference on Evolutionary Computation Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98TH8360).

[22]  Joel H. Saltz,et al.  A Performance Prediction Framework for Data Intensive Applications on Large Scale Parallel Machines , 1998, LCR.

[23]  John A. Kunze,et al.  Dublin Core Metadata for Resource Discovery , 1998, RFC.

[24]  David E. Goldberg,et al.  The compact genetic algorithm , 1999, IEEE Trans. Evol. Comput..

[25]  G. Harik Linkage Learning via Probabilistic Modeling in the ECGA , 1999 .

[26]  Joel H. Saltz,et al.  Design of a framework for data-intensive wide-area applications , 2000, Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556).

[27]  Erick Cantú-Paz,et al.  Efficient and Accurate Parallel Genetic Algorithms , 2000, Genetic Algorithms and Evolutionary Computation.

[28]  Pedro Larrañaga,et al.  Estimation of Distribution Algorithms , 2002, Genetic Algorithms and Evolutionary Computation.

[29]  David E. Goldberg,et al.  The Design of Innovation: Lessons from and for Competent Genetic Algorithms , 2002 .

[30]  David E. Goldberg,et al.  A Survey of Optimization by Building and Using Probabilistic Models , 2002, Comput. Optim. Appl..

[31]  Ian T. Foster,et al.  The virtual data grid: a new model and architecture for data-intensive collaboration , 2003, 15th International Conference on Scientific and Statistical Database Management, 2003..

[32]  David E. Goldberg,et al.  Designing Competent Mutation Operators Via Probabilistic Model Building of Neighborhoods , 2004, GECCO.

[33]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[34]  Dan Brickley,et al.  Rdf vocabulary description language 1.0 : Rdf schema , 2004 .

[35]  Marco Tomassini,et al.  Takeover time curves in random and small-world structured populations , 2005, GECCO '05.

[36]  Nenad Medvidovic,et al.  A software architecture-based framework for highly distributed and data intensive scientific applications , 2006, ICSE.

[37]  Xavier Llorá E2K: evolution to knowledge , 2006, SEVO.

[38]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[39]  Bu-Sung Lee,et al.  Efficient Hierarchical Parallel Genetic Algorithms using Grid computing , 2007, Future Gener. Comput. Syst..

[40]  Xavier Llorà,et al.  Towards billion-bit optimization via a parallel estimation of distribution algorithm , 2007, GECCO '07.

[41]  Rajkumar Buyya,et al.  MRPGA: An Extension of MapReduce for Parallelizing Genetic Algorithms , 2008, 2008 IEEE Fourth International Conference on eScience.

[42]  Xavier Llorà,et al.  Meandre: Semantic-Driven Data-Intensive Flows in the Clouds , 2008, 2008 IEEE Fourth International Conference on eScience.

[43]  Geoffrey C. Fox,et al.  MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[44]  Griffin Caprio,et al.  Parallel Metaheuristics , 2008, IEEE Distributed Systems Online.

[45]  Xavier Llorà Data-intensive computing for competent genetic algorithms: a pilot study using meandre , 2009, GECCO '09.