Towards Learning Universal Hyperparameter Optimizers with Transformers

Meta-learning hyperparameter optimization (HPO) algorithms from prior experiments is a promising approach to improve optimization efficiency over objective functions from a similar distribution. However, existing methods are restricted to learning from experiments sharing the same set of hyperparameters. In this paper, we introduce the O PT F ORMER , the first text-based Transformer HPO framework that provides a universal end-to-end interface for jointly learning policy and function prediction when trained on vast tuning data from the wild, such as Google’s Vizier database, one of the world’s largest HPO datasets. Our extensive experiments demonstrate that the O PT F ORMER can simultaneously imitate at least 7 different HPO algorithms, which can be further improved via its function uncertainty es-timates. Compared to a Gaussian Process, the O PT F ORMER also learns a robust prior distribution for hyperparameter response functions, and can thereby provide more accurate and better calibrated predictions. This work paves the path to future extensions for training a Transformer-based model as a general HPO optimizer.

[1]  H. Ammar,et al.  An Empirical Study of Assumptions in Bayesian Optimisation , 2021 .

[2]  George E. Dahl,et al.  Pre-training helps Bayesian optimization too , 2022, ArXiv.

[3]  Sebastian Pineda Arango,et al.  Transformers Can Do Bayesian Inference , 2021, ICLR.

[4]  Franccois Charton Linear algebra with transformers , 2021, ArXiv.

[5]  Marc Schoenauer,et al.  Frugal Machine Learning , 2021, ArXiv.

[6]  Ali Ghodsi,et al.  SymbolicGPT: A Generative Transformer Model for Symbolic Regression , 2021, arXiv.org.

[7]  Josif Grabocka,et al.  HPO-B: A Large-Scale Reproducible Benchmark for Black-Box HPO based on OpenML , 2021, NeurIPS Datasets and Benchmarks.

[8]  Pieter Abbeel,et al.  Decision Transformer: Reinforcement Learning via Sequence Modeling , 2021, NeurIPS.

[9]  Navin Goyal,et al.  Analyzing the Nuances of Transformers' Polynomial Simplification Abilities , 2021, ArXiv.

[10]  David Silver,et al.  Online and Offline Reinforcement Learning by Planning with a Learned Model , 2021, NeurIPS.

[11]  Josif Grabocka,et al.  Few-Shot Bayesian Optimization with Deep Kernel Surrogates , 2021, ICLR.

[12]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[13]  Rethinking Attention with Performers , 2020, ICLR.

[14]  Andreas Krause,et al.  PACOH: Bayes-Optimal Meta-Learning with PAC-Guarantees , 2020, ICML.

[15]  Myle Ott,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[16]  Isabelle Guyon,et al.  Bayesian Optimization is Superior to Random Search for Machine Learning Hyperparameter Tuning: Analysis of the Black-Box Optimization Challenge 2020 , 2021, NeurIPS.

[17]  Hanxiao Liu,et al.  PyGlove: Symbolic Programming for Automated Machine Learning , 2020, NeurIPS.

[18]  Matthias Seeger,et al.  Amazon SageMaker Automatic Model Tuning: Scalable Black-box Optimization , 2020, ArXiv.

[19]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[20]  Ananthan Nambiar,et al.  Transforming the Language of Life: Transformer Neural Networks for Protein Prediction Tasks , 2020, bioRxiv.

[21]  S. Levine,et al.  Conservative Q-Learning for Offline Reinforcement Learning , 2020, NeurIPS.

[22]  S. Levine,et al.  Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[23]  Yongjae Lee,et al.  Basic Enhancement Strategies When Using Bayesian Optimization for Hyperparameter Tuning of Deep Neural Networks , 2020, IEEE Access.

[24]  Yi Yang,et al.  NAS-Bench-201: Extending the Scope of Reproducible Neural Architecture Search , 2020, ICLR.

[25]  Guillaume Lample,et al.  Deep Learning for Symbolic Mathematics , 2019, ICLR.

[26]  D. Golovin,et al.  Gradientless Descent: High-Dimensional Zeroth-Order Optimization , 2019, ICLR.

[27]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[28]  Lukas P. Fröhlich,et al.  Meta-Learning Acquisition Functions for Transfer Learning in Bayesian Optimization , 2019, ICLR.

[29]  Juergen Schmidhuber,et al.  Reinforcement Learning Upside Down: Don't Predict Rewards - Just Map Them to Actions , 2019, ArXiv.

[30]  Anne Auger,et al.  COCO: The Large Scale Black-Box Optimization Benchmarking (bbob-largescale) Test Suite , 2019, ArXiv.

[31]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[32]  Alok Aggarwal,et al.  Regularized Evolution for Image Classifier Architecture Search , 2018, AAAI.

[33]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[34]  Leslie Pack Kaelbling,et al.  Regret bounds for meta Bayesian optimization with an unknown Gaussian process prior , 2018, NeurIPS.

[35]  P. Frazier Bayesian Optimization , 2018, Recent Advances in Optimization and Modeling of Contemporary Problems.

[36]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[37]  Ion Stoica,et al.  Tune: A Research Platform for Distributed Model Selection and Training , 2018, ArXiv.

[38]  Matthias W. Seeger,et al.  Scalable Hyperparameter Transfer Learning , 2018, NeurIPS.

[39]  Matthias Feurer Scalable Meta-Learning for Bayesian Optimization using Ranking-Weighted Gaussian Process Ensembles , 2018 .

[40]  D. Sculley,et al.  Google Vizier: A Service for Black-Box Optimization , 2017, KDD.

[41]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[42]  Pierre Bourhis,et al.  JSON: Data model, Query languages and Schema specification , 2017, PODS.

[43]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[44]  Misha Denil,et al.  Learning to Learn without Gradient Descent by Gradient Descent , 2016, ICML.

[45]  Matthias Poloczek,et al.  Multi-Information Source Optimization , 2016, NIPS.

[46]  Peter L. Bartlett,et al.  RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[47]  Matthias Poloczek,et al.  Warm starting Bayesian optimization , 2016, 2016 Winter Simulation Conference (WSC).

[48]  Nando de Freitas,et al.  Taking the Human Out of the Loop: A Review of Bayesian Optimization , 2016, Proceedings of the IEEE.

[49]  Frank Hutter,et al.  Initializing Bayesian Hyperparameter Optimization via Meta-Learning , 2015, AAAI.

[50]  Milos Hauskrecht,et al.  Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.

[51]  Luís Torgo,et al.  OpenML: networked science in machine learning , 2014, SKDD.

[52]  Gideon S. Mann,et al.  Efficient Transfer Learning Method for Automatic Hyperparameter Tuning , 2014, AISTATS.

[53]  Jasper Snoek,et al.  Multi-Task Bayesian Optimization , 2013, NIPS.

[54]  Michèle Sebag,et al.  Collaborative hyperparameter tuning , 2013, ICML.

[55]  David D. Cox,et al.  Hyperopt: A Python Library for Optimizing the Hyperparameters of Machine Learning Algorithms , 2013, SciPy.

[56]  Andreas Krause,et al.  Contextual Gaussian Process Bandit Optimization , 2011, NIPS.

[57]  Xin-She Yang,et al.  Eagle Strategy Using Lévy Walk and Firefly Algorithms for Stochastic Optimization , 2010, NICSO.

[58]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[59]  Jorge J. Moré,et al.  Digital Object Identifier (DOI) 10.1007/s101070100263 , 2001 .

[60]  James Kennedy,et al.  Particle swarm optimization , 1995, Proceedings of ICNN'95 - International Conference on Neural Networks.

[61]  M. Rosenblatt Remarks on a Multivariate Transformation , 1952 .