Learning Exploration Policies with Models

Reinforcement learning can greatly profit from world models updated by experience and used for computing policies. Fast discovery of near optimal policies however requires to focus on "useful" experiences. Using an additional exploration model we learn an exploration policy maximiz ing "exploration rewards" for visits of states that promise information gain. We augment this approach by an extension of Kaelbling's Interval Estimation algorithm to the model based case. Experimental results in stochastic environments demonstrate advantages of this hybrid approach.