Is Bang-Bang Control All You Need? Solving Continuous Control with Bernoulli Policies

Reinforcement learning (RL) for continuous control typically employs distributions whose support covers the entire action space. In this work, we investigate the colloquially known phenomenon that trained agents often prefer actions at the boundaries of that space. We draw theoretical connections to the emergence of bangbang behavior in optimal control, and provide extensive empirical evaluation across a variety of recent RL algorithms. We replace the normal Gaussian by a Bernoulli distribution that solely considers the extremes along each action dimension a bang-bang controller. Surprisingly, this achieves state-of-the-art performance on several continuous control benchmarks in contrast to robotic hardware, where energy and maintenance cost affect controller choices. Since exploration, learning, and the final solution are entangled in RL, we provide additional imitation learning experiments to reduce the impact of exploration on our analysis. Finally, we show that our observations generalize to environments that aim to model realworld challenges and evaluate factors to mitigate the emergence of bang-bang solutions. Our findings emphasise challenges for benchmarking continuous control algorithms, particularly in light of potential real-world applications.3

[1]  Twan Koolen,et al.  Team IHMC's Lessons Learned from the DARPA Robotics Challenge Trials , 2015, J. Field Robotics.

[2]  Bao-Zhu Guo,et al.  The Bang–Bang Property of Time-Varying Optimal Time Control for Null Controllable Heat Equation , 2019, J. Optim. Theory Appl..

[3]  Martin D. Levine,et al.  A two-stage learning control system , 1970 .

[4]  Raia Hadsell,et al.  Value constrained model-free continuous control , 2019, ArXiv.

[5]  J P Lasalle,et al.  TIME OPTIMAL CONTROL SYSTEMS. , 1959, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[7]  Nir Levine,et al.  An empirical investigation of the challenges of real-world reinforcement learning , 2020, ArXiv.

[8]  P. Goulart,et al.  High-Speed Finite Control Set Model Predictive Control for Power Electronics , 2015, IEEE Transactions on Power Electronics.

[9]  Emanuel Joos,et al.  Reinforcement Learning of Musculoskeletal Control from Functional Simulations , 2020, MICCAI.

[10]  Shimon Whiteson,et al.  Growing Action Spaces , 2019, ICML.

[11]  Igor Kluvánek,et al.  The bang-bang principle , 1978 .

[12]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[13]  Yunhao Tang,et al.  Discretizing Continuous Action Space for On-Policy Optimization , 2019, AAAI.

[14]  Wilko Schwarting,et al.  Learning to Plan Optimistically: Uncertainty-Guided Deep Exploration via Latent Model Ensembles , 2020, ArXiv.

[15]  Benjamin J. Hodel Learning to Operate an Excavator via Policy Optimization , 2018 .

[16]  Martin A. Riedmiller,et al.  Compositional Transfer in Hierarchical Reinforcement Learning , 2019, Robotics: Science and Systems.

[17]  Sebastian Scherer,et al.  Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution , 2017, ICML.

[18]  L. M. Sonneborn,et al.  The Bang-Bang Principle for Linear Control Systems , 1964 .

[19]  Mohammad Norouzi,et al.  Mastering Atari with Discrete World Models , 2020, ICLR.

[20]  Yunhao Tang,et al.  Discrete Action On-Policy Learning with Action-Value Critic , 2020, AISTATS.

[21]  Arnaud Münch,et al.  Numerical approximation of bang-bang controls for the heat equation: An optimal design approach , 2013, Syst. Control. Lett..

[22]  Sebastien Gros,et al.  MPC-based Reinforcement Learning for Economic Problems with Application to Battery Storage , 2021, 2021 European Control Conference (ECC).

[23]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[24]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[25]  Sergio Gomez Colmenarejo,et al.  Acme: A Research Framework for Distributed Reinforcement Learning , 2020, ArXiv.

[26]  R. Bellman,et al.  On the “bang-bang” control problem , 1956 .

[27]  Cecilia Laschi,et al.  Model-Based Reinforcement Learning for Closed-Loop Dynamic Control of Soft Robotic Manipulators , 2019, IEEE Transactions on Robotics.

[28]  Yuval Tassa,et al.  Relative Entropy Regularized Policy Iteration , 2018, ArXiv.

[29]  Gerd Wachsmuth,et al.  Second-Order Analysis and Numerical Approximation for Bang-Bang Bilinear Control Problems , 2017, SIAM J. Control. Optim..

[30]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[31]  Ning Chen,et al.  Time-varying bang-bang property of time optimal controls for heat equation and its application , 2018, Syst. Control. Lett..

[32]  Gerd Wachsmuth,et al.  Sufficient Second-Order Conditions for Bang-Bang Control Problems , 2017, SIAM J. Control. Optim..

[33]  L. A. Manita Optimal operating modes with chattering switching in manipulator control problems , 2000 .

[34]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[35]  P. Schrimpf,et al.  Dynamic Programming , 2011 .

[36]  K. Fu,et al.  A heuristic approach to reinforcement learning control systems , 1965 .

[37]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[38]  Arash Tavakoli,et al.  Action Branching Architectures for Deep Reinforcement Learning , 2017, AAAI.

[39]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[40]  Sina Ober-Blöbaum,et al.  Second-Order Switching Time Optimization for Switched Dynamical Systems , 2016, IEEE Transactions on Automatic Control.

[41]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[42]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[43]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[44]  Che Wang,et al.  Striving for Simplicity and Performance in Off-Policy DRL: Output Normalization and Non-Uniform Sampling , 2020, ICML.

[45]  S. Sastry,et al.  Zeno hybrid systems , 2001 .

[46]  Karl Kunisch,et al.  The bang-bang property of time optimal controls for the Burgers equation , 2014 .

[47]  Fabio Pardo,et al.  Tonic: A Deep Reinforcement Learning Library for Fast Prototyping and Benchmarking , 2020, ArXiv.

[48]  Marius Tucsnak,et al.  Maximum Principle and Bang-Bang Property of Time Optimal Controls for Schrödinger-Type Systems , 2013, SIAM J. Control. Optim..

[49]  H. Maurer,et al.  Optimization methods for the verification of second order sufficient conditions for bang–bang controls , 2005 .

[50]  Charles W. Anderson,et al.  Learning to Control an Inverted Pendulum with Connectionist Networks , 1988, 1988 American Control Conference.

[51]  Donald E. Kirk,et al.  Optimal control theory : an introduction , 1970 .

[52]  Petros Koumoutsakos,et al.  Remember and Forget for Experience Replay , 2018, ICML.

[53]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[54]  Sandy H. Huang,et al.  Learning Gentle Object Manipulation with Curiosity-Driven Deep Reinforcement Learning , 2019, ArXiv.

[55]  Lorenz Wellhausen,et al.  Learning quadrupedal locomotion over challenging terrain , 2020, Science Robotics.

[56]  Yuval Tassa,et al.  dm_control: Software and Tasks for Continuous Control , 2020, Softw. Impacts.

[57]  Yubiao Zhang,et al.  Decompositions and bang-bang properties , 2016, 1603.05362.

[58]  Sergey M. Plis,et al.  Learning to Run challenge solutions: Adapting reinforcement learning methods for neuromusculoskeletal environments , 2018, ArXiv.