Exploration and exploitation balance management in fuzzy reinforcement learning

This paper offers a fuzzy balance management scheme between exploration and exploitation, which can be implemented in any critic-only fuzzy reinforcement learning method. The paper, however, focuses on a newly developed continuous reinforcement learning method, called fuzzy Sarsa learning (FSL) due to its advantages. Establishing balance greatly depends on the accuracy of action value function approximation. At first, the overfitting problem in approximating action value function in continuous reinforcement learning algorithms is discussed, and a new adaptive learning rate is proposed to prevent this problem. By relating the learning rate to the inverse of ''fuzzy visit value'' of the current state, the training data set is forced to have uniform effect on the weight parameters of the approximator and hence overfitting is resolved. Then, a fuzzy balancer is introduced to balance exploration vs. exploitation by generating a suitable temperature factor for the Softmax formula. Finally, an enhanced FSL (EFSL) is offered by integrating the proposed adaptive learning rate and the fuzzy balancer into FSL. Simulation results show that EFSL eliminates overfitting, well manages balance, and outperforms FSL in terms of learning speed and action quality.

[1]  M. N. Ahmadabadi,et al.  Fuzzy Sarsa Learning and the proof of existence of its stationary points , 2008 .

[2]  P. Glorennec,et al.  Fuzzy Q-learning , 1997, Proceedings of 6th International Fuzzy Systems Conference.

[3]  Gary G. Yen,et al.  Coordination of exploration and exploitation in a dynamic environment , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[4]  Shigenobu Kobayashi,et al.  k-Certainty Exploration Method: An Action Selector to Identify the Environment in Reinforcement Learning , 1997, Artif. Intell..

[5]  Jeremy Wyatt,et al.  A modified approach to fuzzy Q learning for mobile robots , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[6]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[7]  Ya-Ping Lin,et al.  Reinforcement learning based on local state feature learning and policy adjustment , 2003, Inf. Sci..

[8]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[9]  Lionel Jouffe,et al.  Fuzzy inference system learning by reinforcement methods , 1998, IEEE Trans. Syst. Man Cybern. Part C.

[10]  Chia-Feng Juang,et al.  Zero-order TSK-type fuzzy system learning using a two-phase swarm intelligence algorithm , 2008, Fuzzy Sets Syst..

[11]  Marco Wiering,et al.  Convergence and Divergence in Standard and Averaging Reinforcement Learning , 2004, ECML.

[12]  Hamid R. Berenji,et al.  A convergent actor-critic-based FRL algorithm with application to power management of wireless transmitters , 2003, IEEE Trans. Fuzzy Syst..

[13]  Gary G. Yen,et al.  Reinforcement learning algorithms for robotic navigation in dynamic environments. , 2004, ISA transactions.

[14]  Chuan-Kai Lin,et al.  A reinforcement learning adaptive fuzzy controller for robots , 2003, Fuzzy Sets Syst..

[15]  David Vengerov,et al.  Dynamic tuning of online data migration policies in hierarchical storage systems using reinforcement learning , 2006 .

[16]  Andrew W. Moore,et al.  Efficient memory-based learning for robot control , 1990 .

[17]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[18]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[19]  Yang Liu,et al.  A new Q-learning algorithm based on the metropolis criterion , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[20]  Mansoor Zolghadri Jahromi,et al.  A proposed method for learning rule weights in fuzzy rule-based classification systems , 2008, Fuzzy Sets Syst..

[21]  Doina Precup,et al.  A Convergent Form of Approximate Policy Iteration , 2002, NIPS.

[22]  Meng Joo Er,et al.  Online tuning of fuzzy inference systems using dynamic fuzzy Q-learning , 2004, IEEE Trans. Syst. Man Cybern. Part B.

[23]  Changjiu Zhou,et al.  Dynamic balance of a biped robot using fuzzy reinforcement learning agents , 2003, Fuzzy Sets Syst..