A New Theoretical Framework for Fast and Accurate Online Decision-Making

We introduce a novel theoretical framework for Return On Investment (ROI) maximization in repeated decision-making. Our setting is motivated by the use case of companies that regularly receive proposals for technological innovations and want to quickly decide whether they are worth implementing. We design an algorithm for learning ROI-maximizing decision-making policies over a sequence of innovation proposals. Our algorithm provably converges to an optimal policy in class Π at a rate of order min { 1/(N∆), N}, where N is the number of innovations and ∆ is the suboptimality gap in Π. A significant hurdle of our formulation, which sets it aside from other online learning problems such as bandits, is that running a policy does not provide an unbiased estimate of its performance.

[1]  A. Wald On Cumulative Sums of Random Variables , 1944 .

[2]  M. Weitzman Optimal search for the best alternative , 1978 .

[3]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[4]  L. Wasserman,et al.  False discovery control with p-value weighting , 2006 .

[5]  Nicolas Vieille,et al.  Social Learning in One-Arm Bandit Problems , 2007 .

[6]  Dean P. Foster,et al.  α‐investing: a procedure for sequential control of expected false discoveries , 2008 .

[7]  C. Chabris,et al.  The allocation of time in decision-making. , 2009, Journal of the European Economic Association.

[8]  Phillip E. Pfeifer,et al.  Marketing Metrics: The Definitive Guide to Measuring Marketing Performance , 2010 .

[9]  J. Neher A problem of multiple comparisons , 2011 .

[10]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[11]  David Tolpin,et al.  Selecting Computations: Theory and Applications , 2012, UAI.

[12]  Mohammad Taghi Hajiaghayi,et al.  Online prophet-inequality matching with applications to ad allocation , 2012, EC '12.

[13]  Aleksandrs Slivkins,et al.  Bandits with Knapsacks , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[14]  G. Previts,et al.  DONALDSON BROWN (1885–1965): THE POWER OF AN INDIVIDUAL AND HIS IDEAS OVER TIME , 2013 .

[15]  Ron Kohavi,et al.  Online controlled experiments at large scale , 2013, KDD.

[16]  P. Heesen,et al.  Dynamic adaptive multiple tests with finite sample FDR control , 2014, 1410.6296.

[17]  Noga Alon,et al.  Online Learning with Feedback Graphs: Beyond Bandits , 2015, COLT.

[18]  E. Glen Weyl,et al.  Descending Price Optimally Coordinates Search , 2016, EC.

[19]  Adel Javanmard,et al.  Online Rules for Control of False Discovery Rate and False Discovery Exceedance , 2016, ArXiv.

[20]  Martin J. Wainwright,et al.  A framework for Multi-A(rmed)/B(andit) Testing with Online FDR Control , 2017, NIPS.

[21]  Martin J. Wainwright,et al.  Online control of the false discovery rate with decaying memory , 2017, NIPS.

[22]  Pete Koomen,et al.  Peeking at A/B Tests: Why it matters, and what to do about it , 2017, KDD.

[23]  Brendan Lucier,et al.  An economic view of prophet inequalities , 2017, SECO.

[24]  D. Robertson,et al.  Online control of the false discovery rate in biomedical research , 2018, 1809.07292.

[25]  E. Glen Weyl,et al.  The A/B Testing Problem , 2018, EC.

[26]  Ang Li,et al.  Multiple testing with the structure‐adaptive Benjamini–Hochberg algorithm , 2016, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[27]  Lalit Jain,et al.  Firing Bandits: Optimizing Crowdfunding , 2018, ICML.

[28]  Ruben Hoeksma,et al.  Recent developments in prophet inequalities , 2019, SECO.

[29]  Aleksandrs Slivkins,et al.  Introduction to Multi-Armed Bandits , 2019, Found. Trends Mach. Learn..

[30]  Virag Shah,et al.  Optimal Testing in the Experiment-rich Regime , 2018, AISTATS.

[31]  Mohammad Taghi Hajiaghayi,et al.  Online Pandora's Boxes and Bandits , 2019, AAAI.

[32]  Falk Lieder,et al.  Doing more with less: meta-reasoning and meta-learning in humans and machines , 2019, Current Opinion in Behavioral Sciences.

[33]  Shiyun Chen,et al.  Contextual Online False Discovery Rate Control , 2019, AISTATS.