Learning Road Traffic Control: Towards Practical Traffic Control Using Policy Gradients

The optimal control of traffic lights in urban road networks is a highly complex problem. Many factors influence the flow of traffic, and hence the performance of a traffic network, of which few can readily be measured. Currently used control systems are often relatively simple and date back several decades, while more sophisticated optimisation methods fail for large networks. Reinforcement learning algorithms are a means of learning control strategies for complex environments, requiring no pre-specified knowledge about possible solutions. Policy-gradient algorithms are reinforcement learning methods that are particularly useful for learning control strategies (policies) for large and only partially observable environments. They use a parameterised function to represent the policy, and perform gradient ascent on the parameters of this function. Convergence to a (local) optimum is guaranteed, under appropriate conditions. In this work, we examine how policy-gradient ascent can be used to learn the control of traffic signals, with the goal of optimising the traffic flow in a road network. We show that our methods perform very well, are able to scale up to large networks and can achieve better results than other commonly used approaches such as saturation balancing algorithms. Acknowledgements I thank Douglas Aberdeen, my main advisor, for his guidance and help, and Bernhard Nebel for making this thesis possible. Thanks to Olivier Buffet for valuable advice throughout this thesis, and for his patient help during long hours of debugging. Conrad Sanderson, Simon G ünter and Malte Helmert have all proof-read parts of this thesis and deserve thanks for their dedication and helpful suggestions. Finally, thanks to all at the Statistical Machine Learning group of NICTA, Canberra, for making me feel at home in the group, and to the New South Wales Roads and Traffic Authority for providing the basis for this thesis through a cooperation project with NICTA.

[1]  M. Peifer,et al.  Traffic Control , 1966, Nature.

[2]  D I Robertson,et al.  "TRANSYT" METHOD FOR AREA TRAFFIC CONTROL , 1969 .

[3]  R. Rescorla,et al.  A theory of Pavlovian conditioning : Variations in the effectiveness of reinforcement and nonreinforcement , 1972 .

[4]  A.G. Sims,et al.  The Sydney coordinated adaptive traffic (SCAT) system philosophy and benefits , 1980, IEEE Transactions on Vehicular Technology.

[5]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[6]  R. D. Bretherton,et al.  Optimizing networks of traffic signals in real time-the SCOOT method , 1991 .

[7]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[8]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[9]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[10]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[11]  Thomas L. Thorpe,et al.  Traac Light Control Using Sarsa with Three State Representations , 1996 .

[12]  John Loch,et al.  Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable Markov Decision Processes , 1998, ICML.

[13]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[14]  Alex M. Andrew,et al.  Reinforcement Learning: : An Introduction , 1998 .

[15]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[16]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[17]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[18]  Jussi Rintanen,et al.  Complexity of Probabilistic Planning under Average Rewards , 2001, IJCAI.

[19]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[20]  Nathan H. Gartner,et al.  Traffic Flow Theory - A State-of-the-Art Report: Revised Monograph on Traffic Flow Theory , 2002 .

[21]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[22]  Bernhard Friedrich,et al.  Data Fusion Techniques for Adaptive Traffic Signal Control , 2003 .

[23]  Dimitri P. Bertsekas,et al.  Least Squares Policy Evaluation Algorithms with Linear Function Approximation , 2003, Discret. Event Dyn. Syst..

[24]  Wang,et al.  Review of road traffic control strategies , 2003, Proceedings of the IEEE.

[25]  Anne Condon,et al.  On the undecidability of probabilistic planning and related stochastic optimization problems , 2003, Artif. Intell..

[26]  Douglas Aberdeen,et al.  Policy-Gradient Algorithms for Partially Observable Markov Decision Processes , 2003 .

[27]  A. Koopman,et al.  Simulation and optimization of traffic in a city , 2004, IEEE Intelligent Vehicles Symposium, 2004.

[28]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[29]  Andrew Y. Ng,et al.  On Local Rewards and Scaling Distributed Reinforcement Learning , 2005, NIPS.

[30]  E.H.J. Nijhuis,et al.  Cooperative multi-agent reinforcement learning of traffic lights , 2005 .

[31]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[32]  Carlos Guestrin,et al.  A robust architecture for distributed inference in sensor networks , 2005, IPSN 2005. Fourth International Symposium on Information Processing in Sensor Networks, 2005..

[33]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.