Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles

We present a canonical way to turn any smooth parametric family of probability distributions on an arbitrary search space X into a continuous-time black-box optimization method on X, the information-geometric optimization (IGO) method. Invariance as a major design principle keeps the number of arbitrary choices to a minimum. The resulting IGO flow is the flow of an ordinary differential equation conducting the natural gradient ascent of an adaptive, time-dependent transformation of the objective function. It makes no particular assumptions on the objective function to be optimized. The IGO method produces explicit IGO algorithms through time discretization. It naturally recovers versions of known algorithms and offers a systematic way to derive new ones. In continuous search spaces, IGO algorithms take a form related to natural evolution strategies (NES). The cross-entropy method is recovered in a particular case with a large time step, and can be extended into a smoothed, parametrization-independent maximum likelihood update (IGO-ML). When applied to the family of Gaussian distributions on Rd, the IGO framework recovers a version of the well-known CMA-ES algorithm and of xNES. For the family of Bernoulli distributions on {0, 1}d, we recover the seminal PBIL algorithm and cGA. For the distributions of restricted Boltzmann machines, we naturally obtain a novel algorithm for discrete optimization on {0, 1}d. All these algorithms are natural instances of, and unified under, the single information-geometric optimization framework. The IGO method achieves, thanks to its intrinsic formulation, maximal invariance properties: invariance under reparametrization of the search space X, under a change of parameters of the probability distribution, and under increasing transformation of the function to be optimized. The latter is achieved through an adaptive, quantile-based formulation of the objective. Theoretical considerations strongly suggest that IGO algorithms are essentially characterized by a minimal change of the distribution over time. Therefore they have minimal loss in diversity through the course of optimization, provided the initial diversity is high. First experiments using restricted Boltzmann machines confirm this insight. As a simple consequence, IGO seems to provide, from information theory, an elegant way to simultaneously explore several valleys of a fitness landscape in a single run.

[1]  H. Jeffreys An invariant form for the prior probability in estimation problems , 1946, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[2]  Robert Hooke,et al.  `` Direct Search'' Solution of Numerical and Statistical Problems , 1961, JACM.

[3]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[4]  Ingo Rechenberg,et al.  Evolutionsstrategie : Optimierung technischer Systeme nach Prinzipien der biologischen Evolution , 1973 .

[5]  W. Vent,et al.  Rechenberg, Ingo, Evolutionsstrategie — Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. 170 S. mit 36 Abb. Frommann‐Holzboog‐Verlag. Stuttgart 1973. Broschiert , 1975 .

[6]  P. Kloeden,et al.  Numerical Solution of Stochastic Differential Equations , 1992 .

[7]  P. Billingsley,et al.  Probability and Measure , 1980 .

[8]  Jorge J. Moré,et al.  Testing Unconstrained Optimization Software , 1981, TOMS.

[9]  Shun-ichi Amari,et al.  Differential geometry of statistical inference , 1983 .

[10]  J. Burbea Informative Geometry of Probability Spaces , 1984 .

[11]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[12]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[13]  甘利 俊一 Differential geometry in statistical inference , 1987 .

[14]  L. Darrell Whitley,et al.  The GENITOR Algorithm and Selection Pressure: Why Rank-Based Allocation of Reproductive Trials is Best , 1989, ICGA.

[15]  H. B. Barlow,et al.  Unsupervised Learning , 1989, Neural Computation.

[16]  Luís B. Almeida,et al.  Acceleration Techniques for the Backpropagation Algorithm , 1990, EURASIP Workshop.

[17]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[18]  C. R. Rao,et al.  Information and the Accuracy Attainable in the Estimation of Statistical Parameters , 1992 .

[19]  L. Schwartz Calcul différentiel et équations différentielles , 1992 .

[20]  Hans-Paul Schwefel,et al.  Evolution and Optimum Seeking: The Sixth Generation , 1993 .

[21]  Shumeet Baluja,et al.  A Method for Integrating Genetic Search Based Function Optimization and Competitive Learning , 1994 .

[22]  Ingo Rechenberg,et al.  Evolutionsstrategie '94 , 1994, Werkstatt Bionik und Evolutionstechnik.

[23]  Hans-Paul Schwefel,et al.  Evolution and optimum seeking , 1995, Sixth-generation computer technology series.

[24]  Rich Caruana,et al.  Removing the Genetics from the Standard Genetic Algorithm , 1995, ICML.

[25]  Nikolaus Hansen,et al.  Adapting arbitrary normal mutation distributions in evolution strategies: the covariance matrix adaptation , 1996, Proceedings of IEEE International Conference on Evolutionary Computation.

[26]  O. SIAMJ.,et al.  ON THE CONVERGENCE OF PATTERN SEARCH ALGORITHMS , 1997 .

[27]  Bruno Sareni,et al.  Fitness sharing and niching methods revisited , 1998, IEEE Trans. Evol. Comput..

[28]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[29]  J. Snyder Coupling , 1998, Critical Inquiry.

[30]  David E. Goldberg,et al.  The compact genetic algorithm , 1999, IEEE Trans. Evol. Comput..

[31]  R. Rubinstein The Cross-Entropy Method for Combinatorial and Continuous Optimization , 1999 .

[32]  Terrence J. Sejnowski,et al.  Unsupervised Learning , 2018, Encyclopedia of GIS.

[33]  H. Thorisson Coupling, stationarity, and regeneration , 2000 .

[34]  Shun-ichi Amari,et al.  Methods of information geometry , 2000 .

[35]  A. Berny,et al.  An adaptive scheme for real function optimization acting as a selection operator , 2000, 2000 IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks. Proceedings of the First IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks (Cat. No.00.

[36]  Arnaud Berny Selection and Reinforcement Learning for Combinatorial Optimization , 2000, PPSN.

[37]  Xin Yao,et al.  Parallel Problem Solving from Nature PPSN VI , 2000, Lecture Notes in Computer Science.

[38]  J. A. Lozano,et al.  Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation , 2001 .

[39]  Hans-Georg Beyer,et al.  The Theory of Evolution Strategies , 2001, Natural Computing Series.

[40]  Nikolaus Hansen,et al.  Completely Derandomized Self-Adaptation in Evolution Strategies , 2001, Evolutionary Computation.

[41]  Jorge J. Moré,et al.  Digital Object Identifier (DOI) 10.1007/s101070100263 , 2001 .

[42]  Arnaud Berny Boltzmann Machine for Population-Based Incremental Learning , 2002, ECAI.

[43]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[44]  David E. Goldberg,et al.  A Survey of Optimization by Building and Using Probabilistic Models , 2002, Comput. Optim. Appl..

[45]  Petros Koumoutsakos,et al.  Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES) , 2003, Evolutionary Computation.

[46]  Gunnar Rätsch,et al.  Advanced Lectures on Machine Learning , 2004, Lecture Notes in Computer Science.

[47]  Anne Auger,et al.  EEDA : A New Robust Estimation of Distribution Algorithms , 2004 .

[48]  Marc Toussaint Notes on information geometry and evolutionary processes , 2004, ArXiv.

[49]  Hans-Paul Schwefel,et al.  Evolution strategies – A comprehensive introduction , 2002, Natural Computing.

[50]  Dirk P. Kroese,et al.  The Cross Entropy Method: A Unified Approach To Combinatorial Optimization, Monte-carlo Simulation (Information Science and Statistics) , 2004 .

[51]  Nikolaus Hansen,et al.  Evaluating the CMA Evolution Strategy on Multimodal Test Functions , 2004, PPSN.

[52]  Franz Rothlauf,et al.  Behaviour of UMDA/sub c/ with truncation selection on monotonous functions , 2005, 2005 IEEE Congress on Evolutionary Computation.

[53]  P. Rousseeuw,et al.  Wiley Series in Probability and Mathematical Statistics , 2005 .

[54]  Marcus Gallagher,et al.  Population-Based Continuous Optimization, Probabilistic Modelling and Mean Shift , 2005, Evolutionary Computation.

[55]  Shie Mannor,et al.  A Tutorial on the Cross-Entropy Method , 2005, Ann. Oper. Res..

[56]  Dirk V. Arnold,et al.  Improving Evolution Strategies through Active Covariance Matrix Adaptation , 2006, 2006 IEEE International Conference on Evolutionary Computation.

[57]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[58]  Lih-Yuan Deng,et al.  The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning , 2006, Technometrics.

[59]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[60]  Nikolaus Hansen,et al.  The CMA Evolution Strategy: A Comparing Review , 2006, Towards a New Evolutionary Computation.

[61]  Nikolaus Hansen,et al.  An Analysis of Mutative -Self-Adaptation on Linear Fitness Functions , 2006, Evolutionary Computation.

[62]  R. Schilling Measures, Integrals and Martingales , 2006 .

[63]  Dirk V. Arnold,et al.  Weighted multirecombination evolution strategies , 2006, Theor. Comput. Sci..

[64]  Universitext An Introduction to Ordinary Differential Equations , 2006 .

[65]  Jürgen Branke,et al.  Addressing sampling errors and diversity loss in UMDA , 2007, GECCO '07.

[66]  Nicolas Le Roux,et al.  Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.

[67]  Jihnhee Yu,et al.  Measures, Integrals and Martingales , 2007, Technometrics.

[68]  Petr Posík Preventing Premature Convergence in a Simple EDA Via Global Step Size Setting , 2008, PPSN.

[69]  Ruslan Salakhutdinov,et al.  On the quantitative analysis of deep belief networks , 2008, ICML '08.

[70]  Raymond Ros,et al.  A Simple Modification in CMA-ES Achieving Linear Time and Space Complexity , 2008, PPSN.

[71]  Tom Schaul,et al.  Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[72]  D. O’Regan,et al.  An Introduction to Ordinary Differential Equations , 2008 .

[73]  Matteo Matteucci,et al.  An information geometry perspective on estimation of distribution algorithms: boundary analysis , 2008, GECCO '08.

[74]  Tom Schaul,et al.  Efficient natural evolution strategies , 2009, GECCO.

[75]  Christian Igel,et al.  Efficient covariance matrix update for variable metric evolution strategies , 2009, Machine Learning.

[76]  Ruslan Salakhutdinov,et al.  Learning in Markov Random Fields using Tempered Transitions , 2009, NIPS.

[77]  Nikolaus Hansen,et al.  Benchmarking a BI-population CMA-ES on the BBOB-2009 function testbed , 2009, GECCO '09.

[78]  Isao Ono,et al.  Bidirectional Relation between CMA Evolution Strategies and Natural Evolution Strategies , 2010, PPSN.

[79]  Anne Auger,et al.  Log-Linear Convergence of the Scale-Invariant (µ/µw, lambda)-ES and Optimal µ for Intermediate Recombination for Large Population Sizes , 2010, PPSN.

[80]  Tom Schaul,et al.  Exponential natural evolution strategies , 2010, GECCO '10.

[81]  Pascal Vincent,et al.  Parallel Tempering for Training of Restricted Boltzmann Machines , 2010 .

[82]  Ponnuthurai N. Suganthan,et al.  Real-parameter evolutionary multimodal optimization - A survey of the state-of-the-art , 2011, Swarm Evol. Comput..

[83]  Tom Schaul,et al.  High dimensions and heavy tails for natural evolution strategies , 2011, GECCO '11.

[84]  P. Deuflhard Newton Methods for Nonlinear Problems: Affine Invariance and Adaptive Algorithms , 2011 .

[85]  Matteo Matteucci,et al.  Towards the geometry of estimation of distribution algorithms based on the exponential family , 2011, FOGA '11.

[86]  Evgueni A. Haroutunian,et al.  Information Theory and Statistics , 2011, International Encyclopedia of Statistical Science.

[87]  Anne Auger,et al.  Convergence of the Continuous Time Trajectories of Isotropic Evolution Strategies on Monotonic $\mathcal C^2$ -composite Functions , 2012, PPSN.

[88]  Klaus-Robert Müller,et al.  Deep Boltzmann Machines and the Centering Trick , 2012, Neural Networks: Tricks of the Trade.

[89]  Youhei Akimoto,et al.  Convergence of the Continuous Time Trajectories of Isotropic Evolution Strategies on Monotonic C^2-composite Functions , 2012 .

[90]  Pascal Vincent,et al.  Unsupervised Feature Learning and Deep Learning: A Review and New Perspectives , 2012, ArXiv.

[91]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[92]  P. Olver Nonlinear Systems , 2013 .

[93]  Tom Schaul,et al.  A linear time natural evolution strategy for non-separable functions , 2011, GECCO.

[94]  Youhei Akimoto,et al.  Objective improvement in information-geometric optimization , 2012, FOGA XII '13.

[95]  Anne Auger,et al.  Principled Design of Continuous Stochastic Search: From Theory to Practice , 2014, Theory and Principled Methods for the Design of Metaheuristics.

[96]  Jiaqiao Hu,et al.  Gradient-Based Adaptive Stochastic Search for Non-Differentiable Optimization , 2013, IEEE Transactions on Automatic Control.

[97]  T. Faniran Numerical Solution of Stochastic Differential Equations , 2015 .

[98]  Jérémy Bensadon,et al.  Black-Box Optimization Using Geodesics in Statistical Manifolds , 2013, Entropy.

[99]  S. Hewitt,et al.  2006 , 2018, Los 25 años de la OMC: Una retrospectiva fotográfica.

[100]  Shalabh Bhatnagar,et al.  Gradient-Based Adaptive Stochastic Search for Simulation Optimization Over Continuous Space , 2018, INFORMS J. Comput..