Parameter exploring policy gradients and their implications

Reinforcement Learning is the most commonly used class of learning algorithms which lets robots or other systems autonomously learn their behaviour. Learning is enabled solely through interaction with the environment. Today’s learning systems are often confronted with high dimensional and continuous problems. To solve those, so-called Policy Gradient methods are used more and more often. The PGPE algorithm developed in this thesis, a new type of Policy Gradient algorithm, allows model-free learning in complex, continuous, partially observable and high dimensional environments. We show that tasks like grasping of glasses and plates with an human-like arm can be learned with this method without prior knowledge, solely with pure model-free reinforcement learning in a simulation environment. Also, the balancing of a humanoid robot perturbed by external forces, as well as dynamic walking behaviour of a mass-spring system could be learned. In all experiments, PGPE learned the given tasks more efficiently than well-established methods. In addition, the use of PGPE is not restricted to robotics. Among several investigated methods, it was the most successful in cracking non-differentiable physical cryptography systems. PGPE is suitable for training multidimensional recurrent neural networks to play Go, or for fine-tuning deep neural nets for computer vision. In the scope of this thesis, the principles used, the advantages and disadvantages as well as the differences with regard to well-established methods are derived and analysed in detail.