Learning from eXtreme Bandit Feedback

We study the problem of batch learning from bandit feedback in the setting of extremely large action spaces. Learning from extreme bandit feedback is ubiquitous in recommendation systems, in which billions of decisions are made over sets consisting of millions of choices in a single day, yielding massive observational data. In these large-scale real-world applications, supervised learning frameworks such as eXtreme Multi-label Classification (XMC) are widely used despite the fact that they incur significant biases due to the mismatch between bandit feedback and supervised labels. Such biases can be mitigated by importance sampling techniques, but these techniques suffer from impractical variance when dealing with a large number of actions. In this paper, we introduce a selective importance sampling estimator (sIS) that operates in a significantly more favorable bias-variance regime. The sIS estimator is obtained by performing importance sampling on the conditional expectation of the reward with respect to a small subset of actions for each instance (a form of Rao-Blackwellization). We employ this estimator in a novel algorithmic procedure---named Policy Optimization for eXtreme Models (POXM)---for learning from bandit feedback on XMC tasks. In POXM, the selected actions for the sIS estimator are the top-p actions of the logging policy, where p is adjusted from the data and is significantly smaller than the size of the action space. We use a supervised-to-bandit conversion on three XMC datasets to benchmark our POXM method against three competing methods: BanditNet, a previously applied partial matching pruning strategy, and a supervised learning baseline. Whereas BanditNet sometimes improves marginally over the logging policy, our experiments show that POXM systematically and significantly improves over all baselines.

[1]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[2]  M. de Rijke,et al.  Large-scale Validation of Counterfactual Learning Methods: A Test-Bed , 2016, ArXiv.

[3]  M. Sklar Fonctions de repartition a n dimensions et leurs marges , 1959 .

[4]  Uri Shalit,et al.  Learning Representations for Counterfactual Inference , 2016, ICML.

[5]  Yiming Yang,et al.  Deep Learning for Extreme Multi-label Text Classification , 2017, SIGIR.

[6]  Rahul,et al.  A Review of Trends and Techniques in Recommender Systems , 2019, 2019 4th International Conference on Internet of Things: Smart Innovation and Usages (IoT-SIU).

[7]  Zihan Zhang,et al.  AttentionXML: Label Tree-based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text Classification , 2019, NeurIPS.

[8]  Yuan Qi,et al.  Cost-Effective Incentive Allocation via Structured Counterfactual Inference , 2019, AAAI.

[9]  Ed H. Chi,et al.  Top-K Off-Policy Correction for a REINFORCE Recommender System , 2018, WSDM.

[10]  Shanfeng Zhu,et al.  HAXMLNet: Hierarchical Attention Network for Extreme Multi-Label Text Classification , 2019, ArXiv.

[11]  Rohit Babbar,et al.  Bonsai - Diverse and Shallow Trees for Extreme Multi-label Classification , 2019, ArXiv.

[12]  A. Zubiaga Enhancing Navigation on Wikipedia with Social Tags , 2012, ArXiv.

[13]  Yue Wang,et al.  Beyond Ranking: Optimizing Whole-Page Presentation , 2016, WSDM.

[14]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[15]  John Langford,et al.  Off-policy evaluation for slate recommendation , 2016, NIPS.

[16]  S. Muthukrishnan,et al.  Offline Evaluation of Ranking Policies with Click Models , 2018, KDD.

[17]  Claudio Gentile,et al.  On multilabel classification and ranking with bandit feedback , 2014, J. Mach. Learn. Res..

[18]  May D. Wang,et al.  Variance Regularized Counterfactual Risk Minimization via Variational Divergence Minimization , 2018, ICML.

[19]  Manik Varma,et al.  Multi-label learning with millions of labels: recommending advertiser bid phrases for web pages , 2013, WWW.

[20]  Bernhard Schölkopf,et al.  Data scarcity, robustness and extreme multi-label classification , 2019, Machine Learning.

[21]  John Langford,et al.  Exploration scavenging , 2008, ICML '08.

[22]  Yuan Qi,et al.  Generative Adversarial User Model for Reinforcement Learning Based Recommendation System , 2018, ICML.

[23]  G. Casella,et al.  Rao-Blackwellisation of sampling schemes , 1996 .

[24]  Yong Yu,et al.  Large-scale Interactive Recommendation with Tree-structured Policy Gradient , 2018, AAAI.

[25]  Bernhard Schölkopf,et al.  DiSMEC: Distributed Sparse Machines for Extreme Multi-label Classification , 2016, WSDM.

[26]  Martha White,et al.  Linear Off-Policy Actor-Critic , 2012, ICML.

[27]  Pradeep Ravikumar,et al.  PPDsparse: A Parallel Primal-Dual Sparse Method for Extreme Classification , 2017, KDD.

[28]  Pradeep Ravikumar,et al.  PD-Sparse : A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classification , 2016, ICML.

[29]  Thorsten Joachims,et al.  Unbiased Learning-to-Rank with Biased Feedback , 2016, WSDM.

[30]  Johannes Fürnkranz,et al.  Efficient Pairwise Multilabel Classification for Large-Scale Problems in the Legal Domain , 2008, ECML/PKDD.

[31]  Jure Leskovec,et al.  Hidden factors and hidden topics: understanding rating dimensions with review text , 2013, RecSys.

[32]  Manik Varma,et al.  Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking & Other Missing Label Applications , 2016, KDD.

[33]  Lihong Li,et al.  Toward Predicting the Outcome of an A/B Experiment for Search Relevance , 2015, WSDM.

[34]  M. de Rijke,et al.  Deep Learning with Logged Bandit Feedback , 2018, ICLR.

[35]  Thorsten Joachims,et al.  The Self-Normalized Estimator for Counterfactual Learning , 2015, NIPS.

[36]  Yi Su,et al.  CAB: Continuous Adaptive Blending for Policy Evaluation and Learning , 2019, ICML.

[37]  Yiming Yang,et al.  X-BERT: eXtreme Multi-label Text Classification with using Bidirectional Encoder Representations from Transformers , 2019 .

[38]  Manik Varma,et al.  Extreme Regression for Dynamic Search Advertising , 2020, WSDM.

[39]  Yukihiro Tagami,et al.  AnnexML: Approximate Nearest Neighbor Search for Extreme Multi-label Classification , 2017, KDD.

[40]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[41]  Wouter Kool,et al.  Ancestral Gumbel-Top-k Sampling for Sampling Without Replacement , 2020, J. Mach. Learn. Res..

[42]  Eyke Hüllermeier,et al.  Extreme F-measure Maximization using Sparse Probability Estimates , 2016, ICML.

[43]  Thorsten Joachims,et al.  Batch Learning from Bandit Feedback through Bias Corrected Reward Imputation , 2019 .

[44]  Thorsten Joachims,et al.  Counterfactual Risk Minimization: Learning from Logged Bandit Feedback , 2015, ICML.

[45]  Hongning Wang,et al.  Model-Based Reinforcement Learning with Adversarial Training for Online Recommendation , 2019, NeurIPS.

[46]  Prateek Jain,et al.  Sparse Local Embeddings for Extreme Multi-label Classification , 2015, NIPS.

[47]  Róbert Busa-Fekete,et al.  A no-regret generalization of hierarchical softmax to extreme multi-label classification , 2018, NeurIPS.

[48]  Ali Mousavi,et al.  Breaking the Glass Ceiling for Embedding-Based Classifiers for Large Output Spaces , 2019, NeurIPS.

[49]  W. Zame,et al.  Counterfactual Policy Optimization Using Domain-Adversarial Neural Networks , 2018 .

[50]  Max Welling,et al.  Estimating Gradients for Discrete Random Variables by Sampling without Replacement , 2020, ICLR.