A Lazy Approach to Online Learning with Constraints

In this paper, we study a sequential decision making problem. The objective is to maximize the total reward while satisfying constraints, which are defined at every time step. The novelty of the setup is our assumption that the rewards and constraints are controlled by a potentially adverse opponent. To solve the problem, we propose a novel expert algorithm that guarantees a vanishing regret while violating only some bounded number of constraints. The quality of our expert solutions is evaluated on a challenging power management problem. Results of our experiments show that online learning with constraints can be carried out successfully in practice. Introduction Online learning with expert advice (Cesa-Bianchi & Lugosi 2006) has been studied extensively by the machine learning community. The framework has been also successfully used to solve many real-world problems, such as adaptive caching (Gramacy et al. 2003) or power management (Helmbold et al. 2000; Dhiman & Simunic 2006; Kveton et al. 2007). The major advantage of the online setting is that no assumption is made about the environment. As a result, there is no need to build its model and estimate its parameters. In turn, this type of learning is naturally robust to environmental changes and suitable for solving dynamic real-world problems. In this paper, we study online learning problems with side constraints. A similar setup was considered by Mannor and Tsitsiklis (2006). Side constraints are common in real-world domains. For instance, power management problems are often formulated as maximizing power savings subject to some average performance criteria. The criteria usually restrict the rate of bad power management actions and can be naturally represented by constraints. Our work makes two contributions. First, we show how to apply prediction with expert advice to solve online optimization problems with constraints. Our solution is both practical and sound. Based on our knowledge, this is the first solution with such properties. Second, we use the proposed approach to solve a real-world power management (PM) problem. The paper is structured as follows. First, we formulate our optimization problem and relate it to the existing work. Second, we propose and analyze a practical solution to the problem based on prediction with expert advice. Third, we evaluate the quality of our solution on a real-world PM problem. Finally, we summarize our work and suggest future research directions. Online constrained optimization In this paper, we study an online learning problem, where an agent wants to maximize its total reward subject to averagecost constraints. At every time t, the agent takes some action θt from the action set A, and then receives a reward rt(θt) ∈ [0, 1] and a cost ct(θt)∈ [0, 1]. We assume that our agent has no prior knowledge on reward and cost functions except that they are bounded. Therefore, they can be generated in a nonstationary or even adverse way. The agent may consider only the past rewards r1, . . . , rt−1 and the past costs c1, . . . , ct−1 when deciding what action θt to take. To clarify our online learning problem and its challenges, we first define an offline version of the problem. This offline version simply assumes that our agents knows all reward and cost terms in advance. In such a setting, the optimal strategy of the agent can be expressed as a solution to the constrained optimization problem: