Learning to Optimise Routing Problems using Policy Optimisation

Deep reinforcement learning (DRL) has demonstrated promising performance to learn effective heuristics to solve complex combinatorial optimisation problems via policy networks. However, traditional reinforcement learning (RL) suffers from insufficient exploration, which often results in pre-convergence to poor policies and many challenges the performance of DRL. To prevent this, we propose an Entropy Regularised Reinforcement Learning (ERRL) method that supports exploration by providing more stochastic policies, improving optimisation. The ERRL method incorporates an entropy term, defined over the policy network's outputs, into the loss function of the policy network. Hence, policy exploration can be explicitly advocated subjected to a balance to maximise the reward. As a result, the risk of pre-convergence to inferior policies can be reduced. We implement the ERRL method based on two existing DRL algorithms. We have compared the performances of our implementations with the two DRL algorithms along with several state-of-the-art heuristic-based non-RL approaches for three categories of routing problems, i.e., travelling salesman problem (TSP), capacitated vehicle routing problem (CVRP) and multiple routing with fixed fleet problems (MRPFF). Experimental results show that the proposed method can find better and faster solutions in most test cases than the state-of-the-art algorithms.