Do Offline Metrics Predict Online Performance in Recommender Systems?

Recommender systems operate in an inherently dynamical setting. Past recommendations influence future behavior, including which data points are observed and how user preferences change. However, experimenting in production systems with real user dynamics is often infeasible, and existing simulation-based approaches have limited scale. As a result, many state-of-the-art algorithms are designed to solve supervised learning problems, and progress is judged only by offline metrics. In this work we investigate the extent to which offline metrics predict online performance by evaluating eleven recommenders across six controlled simulated environments. We observe that offline metrics are correlated with online performance over a range of environments. However, improvements in offline metrics lead to diminishing returns in online performance. Furthermore, we observe that the ranking of recommenders varies depending on the amount of initial offline data available. We study the impact of adding exploration strategies, and observe that their effectiveness, when compared to greedy recommendation, is highly dependent on the recommendation algorithm. We provide the environments and recommenders described in this paper as Reclab: an extensible ready-to-use simulation framework at this https URL.

[1]  Tuan-Anh Nguyen Pham,et al.  Predicting online performance of job recommender systems with offline evaluation , 2019, RecSys.

[2]  Karthik Ramani,et al.  Deconvolving Feedback Loops in Recommender Systems , 2016, NIPS.

[3]  Paul Covington,et al.  Deep Neural Networks for YouTube Recommendations , 2016, RecSys.

[4]  Dietmar Jannach,et al.  Are we really making much progress? A worrying analysis of recent neural recommendation approaches , 2019, RecSys.

[5]  Steffen Rendle,et al.  Factorization Machines with libFM , 2012, TIST.

[6]  Hany Farid,et al.  A Longitudinal Analysis of YouTube's Promotion of Conspiracy Videos , 2020, ArXiv.

[7]  Kartik Hosanagar,et al.  Blockbuster Culture's Next Rise or Fall: The Impact of Recommender Systems on Sales Diversity , 2007, Manag. Sci..

[8]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[9]  Craig Boutilier,et al.  RecSim: A Configurable Simulation Platform for Recommender Systems , 2019, ArXiv.

[10]  Alessandro Lazaric,et al.  Fighting Boredom in Recommender Systems with Linear Reinforcement Learning , 2018, NeurIPS.

[11]  Yehuda Koren,et al.  On the Difficulty of Evaluating Baselines: A Study on Recommender Systems , 2019, ArXiv.

[12]  Hanning Zhou,et al.  A Neural Autoregressive Approach to Collaborative Filtering , 2016, ICML.

[13]  Alan Said,et al.  Offline and Online Evaluation of Recommendations , 2018, Collaborative Recommendations.

[14]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[15]  Tor Lattimore,et al.  Degenerate Feedback Loops in Recommender Systems , 2019, AIES.

[16]  Benjamin Recht,et al.  The Effect of Natural Distribution Shift on Question Answering Models , 2020, ICML.

[17]  David Lee,et al.  Biased assimilation, homophily, and the dynamics of polarization , 2012, Proceedings of the National Academy of Sciences.

[18]  Thorsten Joachims,et al.  Recommendations as Treatments: Debiasing Learning and Evaluation , 2016, ICML.

[19]  Craig Boutilier,et al.  Reinforcement Learning for Slate-based Recommender Systems: A Tractable Decomposition and Practical Methodology , 2019, ArXiv.

[20]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[21]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[22]  Bert Huang,et al.  Beyond Parity: Fairness Objectives for Collaborative Filtering , 2017, NIPS.

[23]  George Karypis,et al.  SLIM: Sparse Linear Methods for Top-N Recommender Systems , 2011, 2011 IEEE 11th International Conference on Data Mining.

[24]  Yuta Saito,et al.  Unbiased Recommender Learning from Missing-Not-At-Random Implicit Feedback , 2020, WSDM.

[25]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[26]  Harald Steck,et al.  Embarrassingly Shallow Autoencoders for Sparse Data , 2019, WWW.

[27]  Myra Spiliopoulou,et al.  Forgetting methods for incremental matrix factorization in recommender systems , 2015, SAC.

[28]  Jaideep Srivastava,et al.  Just in Time Recommendations: Modeling the Dynamics of Boredom in Activity Streams , 2015, WSDM.

[29]  Jakub W. Pachocki,et al.  Dota 2 with Large Scale Deep Reinforcement Learning , 2019, ArXiv.

[30]  Boi Faltings,et al.  Predicting Online Performance of News Recommender Systems Through Richer Evaluation Metrics , 2015, RecSys.

[31]  Bamshad Mobasher,et al.  Controlling Popularity Bias in Learning-to-Rank Recommendation , 2017, RecSys.

[32]  Nicolas Hug,et al.  Surprise: A Python library for recommender systems , 2020, J. Open Source Softw..

[33]  Shawn P. Curley,et al.  Do Recommender Systems Manipulate Consumer Preferences? A Study of Anchoring Effects , 2013, Inf. Syst. Res..

[34]  Benjamin Recht,et al.  Recommendations and user agency: the reachability of collaboratively-filtered information , 2020, FAT*.

[35]  Thorsten Joachims,et al.  Fairness of Exposure in Rankings , 2018, KDD.

[36]  Ed H. Chi,et al.  Top-K Off-Policy Correction for a REINFORCE Recommender System , 2018, WSDM.

[37]  Derek Bridge,et al.  Diversity, Serendipity, Novelty, and Coverage , 2016, ACM Trans. Interact. Intell. Syst..

[38]  Alexandros Karatzoglou,et al.  RecoGym: A Reinforcement Learning Environment for the problem of Product Recommendation in Online Advertising , 2018, ArXiv.

[39]  Fabio Stella,et al.  Contrasting Offline and Online Results when Evaluating Recommendation Algorithms , 2016, RecSys.

[40]  Samy Bengio,et al.  LLORMA: Local Low-Rank Matrix Approximation , 2016, J. Mach. Learn. Res..

[41]  Carlos Riquelme,et al.  Human Interaction with Recommendation Systems , 2017, AISTATS.

[42]  Yehuda Koren,et al.  Factorization meets the neighborhood: a multifaceted collaborative filtering model , 2008, KDD.

[43]  Bamshad Mobasher,et al.  Feedback Loop and Bias Amplification in Recommender Systems , 2020, CIKM.

[44]  Yehuda Koren,et al.  Collaborative filtering with temporal dynamics , 2009, KDD.

[45]  Yehuda Koren,et al.  Modeling relationships at multiple scales to improve accuracy of large recommender systems , 2007, KDD '07.

[46]  Jun Wang,et al.  Unifying user-based and item-based collaborative filtering approaches by similarity fusion , 2006, SIGIR.

[47]  Jonathan L. Herlocker,et al.  Evaluating collaborative filtering recommender systems , 2004, TOIS.

[48]  Iván Cantador,et al.  Time-aware recommender systems: a comprehensive survey and analysis of existing evaluation protocols , 2013, User Modeling and User-Adapted Interaction.

[49]  Barbara E. Engelhardt,et al.  How algorithmic confounding in recommendation systems increases homogeneity and decreases utility , 2017, RecSys.

[50]  Long Tran-Thanh,et al.  Efficient Thompson Sampling for Online Matrix-Factorization Recommendation , 2015, NIPS.

[51]  Tie-Yan Liu,et al.  A Theoretical Analysis of NDCG Type Ranking Measures , 2013, COLT.

[52]  João Gama,et al.  An overview on the exploitation of time in collaborative filtering , 2015, WIREs Data Mining Knowl. Discov..

[53]  Jöran Beel,et al.  A Comparison of Offline Evaluations, Online Evaluations, and User Studies in the Context of Research-Paper Recommender Systems , 2015, TPDL.

[54]  Benjamin Recht,et al.  Do ImageNet Classifiers Generalize to ImageNet? , 2019, ICML.

[55]  Yongfeng Zhang,et al.  Understanding Echo Chambers in E-commerce Recommender Systems , 2020, SIGIR.

[56]  Loren G. Terveen,et al.  Exploring the filter bubble: the effect of using recommender systems on content diversity , 2014, WWW.

[57]  Scott Sanner,et al.  AutoRec: Autoencoders Meet Collaborative Filtering , 2015, WWW.

[58]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.