Dataset Shift in Machine Learning

Dataset shift is a common problem in predictive modeling that occurs when the joint distribution of inputs and outputs differs between training and test stages. Covariate shift, a particular case of dataset shift, occurs when only the input distribution changes. Dataset shift is present in most practical applications, for reasons ranging from the bias introduced by experimental design to the irreproducibility of the testing conditions at training time. (An example is -email spam filtering, which may fail to recognize spam that differs in form from the spam the automatic filter has been built on.) Despite this, and despite the attention given to the apparently similar problems of semi-supervised learning and active learning, dataset shift has received relatively little attention in the machine learning community until recently. This volume offers an overview of current efforts to deal with dataset and covariate shift. The chapters offer a mathematical and philosophical introduction to the problem, place dataset shift in relationship to transfer learning, transduction, local learning, active learning, and semi-supervised learning, provide theoretical views of dataset and covariate shift (including decision theoretic and Bayesian perspectives), and present algorithms for covariate shift. Contributors: Shai Ben-David, Steffen Bickel, Karsten Borgwardt, Michael Brckner, David Corfield, Amir Globerson, Arthur Gretton, Lars Kai Hansen, Matthias Hein, Jiayuan Huang, Takafumi Kanamori, Klaus-Robert Mller, Sam Roweis, Neil Rubens, Tobias Scheffer, Marcel Schmittfull, Bernhard Schlkopf, Hidetoshi Shimodaira, Alex Smola, Amos Storkey, Masashi Sugiyama, Choon Hui Teo Neural Information Processing series

[1]  N. Goodman Fact, Fiction, and Forecast , 1955 .

[2]  G. Pólya,et al.  Mathematics and Plausible Reasoning , 1956 .

[3]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[4]  G. Wahba,et al.  A Correspondence Between Bayesian Estimation on Stochastic Processes and Smoothing by Splines , 1970 .

[5]  W. J. Studden,et al.  Theory Of Optimal Experiments , 1972 .

[6]  H. Akaike A new look at the statistical model identification , 1974 .

[7]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[8]  J. Heckman Shadow prices, market wages, and labor supply , 1974 .

[9]  Steven R. Lerman,et al.  The Estimation of Choice Probabilities from Choice Based Samples , 1977 .

[10]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[11]  Peter Craven,et al.  Smoothing noisy data with spline functions , 1978 .

[12]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[13]  J. Heckman Sample selection bias as a specification error , 1979 .

[14]  Lung-fei Lee Some Approaches to the Correction of Selectivity Bias , 1982 .

[15]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[16]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[17]  P. Green Iteratively reweighted least squares for maximum likelihood estimation , 1984 .

[18]  Shun-ichi Amari,et al.  Differential-geometrical methods in statistics , 1985 .

[19]  John Law,et al.  Robust Statistics—The Approach Based on Influence Functions , 1986 .

[20]  C. Manski Anatomy of the Selection Problem , 1989 .

[21]  Jeffrey A. Dubin,et al.  Selection Bias in Linear Regression, Logit and Probit Models , 1989 .

[22]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[23]  H. James VARIETIES OF SELECTION BIAS , 1990 .

[24]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[25]  Chris J. Skinner,et al.  Analysis of complex surveys , 1991 .

[26]  David J. C. MacKay,et al.  Information-Based Objective Functions for Active Data Selection , 1992, Neural Computation.

[27]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[28]  Christopher Winship,et al.  Models for Sample Selection Bias , 1992 .

[29]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[30]  Noel A Cressie,et al.  Statistics for Spatial Data. , 1992 .

[31]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[32]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[33]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[34]  B. Lindsay Efficiency versus robustness : the case for minimum Hellinger distance and related methods , 1994 .

[35]  B. Lindsay,et al.  Minimum disparity estimation for continuous models: Efficiency, distributions and robustness , 1994 .

[36]  C. Field,et al.  Robust Estimation - a Weighted Maximum-Likelihood Approach , 1994 .

[37]  M. P. Windham Robustifying Model Fitting , 1995 .

[38]  Kenji Fukumizu,et al.  Active Learning in Multilayer Perceptrons , 1995, NIPS.

[39]  Harris Drucker,et al.  Comparison of learning algorithms for handwritten digit recognition , 1995 .

[40]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[41]  Herbert Gish,et al.  Speaker identification via support vector classifiers , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[42]  M. Gibbs,et al.  Efficient implementation of gaussian processes , 1997 .

[43]  Federico Girosi,et al.  Support Vector Machines: Training and Applications , 1997 .

[44]  D. Haussler,et al.  MUTUAL INFORMATION, METRIC ENTROPY AND CUMULATIVE RELATIVE ENTROPY RISK , 1997 .

[45]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[46]  Naoki Abe,et al.  Query Learning Strategies Using Boosting and Bagging , 1998, ICML.

[47]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[48]  F. Vella Estimating Models with Sample Selection Bias: A Survey , 1998 .

[49]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[50]  H. Goldstein,et al.  Weighting for unequal selection probabilities in multilevel models , 1998 .

[51]  David Barber,et al.  Bayesian Classification With Gaussian Processes , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[52]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[53]  Lars Kai Hansen,et al.  Bayesian Averaging is Well-Temperated , 1999, NIPS.

[54]  Nello Cristianini,et al.  Controlling the Sensitivity of Support Vector Machines , 1999 .

[55]  Peter Sollich Probabilistic interpretations and Bayesian methods for support vector machines , 1999 .

[56]  D. Bertsekas,et al.  Incremental subgradient methods for nondifferentiable optimization , 1999, Proceedings of the 38th IEEE Conference on Decision and Control (Cat. No.99CH36304).

[57]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[58]  Volker Tresp,et al.  Mixtures of Gaussian Processes , 2000, NIPS.

[59]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[60]  Shun-ichi Amari,et al.  Methods of information geometry , 2000 .

[61]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[62]  Jonathan Baxter,et al.  A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[63]  Jerry D. Gibson,et al.  Handbook of Image and Video Processing , 2000 .

[64]  Ronitt Rubinfeld,et al.  Testing that distributions are close , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[65]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[66]  Kenji Fukumizu,et al.  Statistical active learning in multilayer perceptrons , 2000, IEEE Trans. Neural Networks Learn. Syst..

[67]  D. Wiens Robust weights and designs for biased regression models: Least squares and generalized M-estimation , 2000 .

[68]  Masashi Sugiyama,et al.  Incremental Active Learning for Optimal Generalization , 2000, Neural Computation.

[69]  Marco Saerens,et al.  Adjusting the Outputs of a Classifier to New a Priori Probabilities May Significantly Improve Classification Accuracy: Evidence from a multi-class problem in remote sensing , 2001, ICML.

[70]  Katya Scheinberg,et al.  Efficient SVM Training Using Low-Rank Kernel Representations , 2002, J. Mach. Learn. Res..

[71]  J. Welsh,et al.  Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer. , 2001, Cancer research.

[72]  Carl E. Rasmussen,et al.  Infinite Mixtures of Gaussian Process Experts , 2001, NIPS.

[73]  Masashi Sugiyama,et al.  Subspace Information Criterion for Model Selection , 2001, Neural Computation.

[74]  S. Dhanasekaran,et al.  Delineation of prognostic biomarkers in prostate cancer , 2001, Nature.

[75]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[76]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[77]  Sumio Watanabe,et al.  Algebraic Analysis for Nonidentifiable Learning Machines , 2001, Neural Computation.

[78]  Carsten O. Peterson,et al.  Estrogen receptor status in breast cancer is associated with remarkably distinct gene expression patterns. , 2001, Cancer research.

[79]  Bianca Zadrozny,et al.  Learning and making decisions when costs and probabilities are both unknown , 2001, KDD '01.

[80]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[81]  J. Copas,et al.  Local sensitivity approximations for selectivity bias , 2001 .

[82]  Masashi Sugiyama,et al.  Optimal design of regularization term and regularization parameter by subspace information criterion , 2002, Neural Networks.

[83]  Ingo Steinwart,et al.  Support Vector Machines are Universally Consistent , 2002, J. Complex..

[84]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[85]  T. Kanamori Statistical Asymptotic Theory of Active Learning , 2002 .

[86]  Shahar Mendelson,et al.  A Few Notes on Statistical Learning Theory , 2002, Machine Learning Summer School.

[87]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[88]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[89]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[90]  Jure Leskovec,et al.  Linear Programming Boosting for Uneven Datasets , 2003, ICML.

[91]  T. Ben-David,et al.  Exploiting Task Relatedness for Multiple , 2003 .

[92]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[93]  Masashi Sugiyama,et al.  Active Learning with Model Selection — Simultaneous Optimization of Sample Points and Models for Trigonometric Polynomial Models , 2003 .

[94]  Thore Graepel,et al.  Invariant Pattern Recognition by Semi-Definite Programming Machines , 2003, NIPS.

[95]  Matthias Hein,et al.  Measure Based Regularization , 2003, NIPS.

[96]  J. Lafferty,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[97]  L. Ghaoui,et al.  Robust Classification with Interval Data , 2003 .

[98]  Mk Warmuth,et al.  Active Learning with SVMs in the Drug Discovery Process , 2003 .

[99]  Hidetoshi Shimodaira,et al.  Active learning algorithm using the maximum weighted log-likelihood estimator , 2003 .

[100]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[101]  Ji Zhu,et al.  A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning , 2004, NIPS.

[102]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[103]  J. Lunceford,et al.  Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study , 2004, Statistics in medicine.

[104]  Charles Elkan,et al.  A Bayesian network framework for reject inference , 2004, KDD.

[105]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[106]  Shyhtsun Felix Wu,et al.  On Attacking Statistical Spam Filters , 2004, CEAS.

[107]  Bianca Zadrozny,et al.  Learning and evaluating classifiers under sample selection bias , 2004, ICML.

[108]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[109]  Motoaki Kawanabe,et al.  Trading Variance Reduction with Unbiasedness: The Regularized Subspace Information Criterion for Robust Model Selection in Kernel Regression , 2004, Neural Computation.

[110]  Naftali Tishby,et al.  Margin based feature selection - theory and algorithms , 2004, ICML.

[111]  Yi Lin,et al.  Support Vector Machines for Classification in Nonstandard Situations , 2002, Machine Learning.

[112]  Peter Sollich,et al.  Bayesian Methods for Support Vector Machines: Evidence and Predictive Class Probabilities , 2002, Machine Learning.

[113]  Bernhard Schölkopf,et al.  Training Invariant Support Vector Machines , 2002, Machine Learning.

[114]  H. Sung Gaussian Mixture Regression and Classification , 2004 .

[115]  Pedro M. Domingos,et al.  Adversarial classification , 2004, KDD.

[116]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[117]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[118]  Neil D. Lawrence,et al.  Extensions of the Informative Vector Machine , 2004, Deterministic and Statistical Methods in Machine Learning.

[119]  Masashi Sugiyama,et al.  Input-dependent estimation of generalization error under covariate shift , 2005 .

[120]  Christopher Meek,et al.  Good Word Attacks on Statistical Spam Filters , 2005, CEAS.

[121]  Thomas Hofmann,et al.  Kernel Methods for Missing Variables , 2005, AISTATS.

[122]  Stephen P. Boyd,et al.  Robust Fisher Discriminant Analysis , 2005, NIPS.

[123]  Roland Eils,et al.  Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes , 2005, BMC Bioinformatics.

[124]  Eytan Ruppin,et al.  Feature Selection Based on the Shapley Value , 2005, IJCAI.

[125]  Miroslav Dudík,et al.  Correcting sample selection bias in maximum entropy density estimation , 2005, NIPS.

[126]  Naftali Tishby,et al.  Generalization in Clustering with Unobserved Features , 2005, NIPS.

[127]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[128]  Nitesh V. Chawla,et al.  Learning From Labeled And Unlabeled Data: An Empirical Study Across Techniques And Domains , 2011, J. Artif. Intell. Res..

[129]  Klaus-Robert Müller,et al.  Model Selection Under Covariate Shift , 2005, ICANN.

[130]  Lang Tong,et al.  Nonparametric change detection and estimation in large-scale sensor networks , 2006, IEEE Transactions on Signal Processing.

[131]  Masashi Sugiyama,et al.  Active Learning in Approximately Linear Regression Based on Conditional Expectation of Generalization Error , 2006, J. Mach. Learn. Res..

[132]  Matthias Hein,et al.  Uniform Convergence of Adaptive Graph-Based Regularization , 2006, COLT.

[133]  Klaus-Robert Müller,et al.  Importance-Weighted Cross-Validation for Covariate Shift , 2006, DAGM-Symposium.

[134]  Masashi Sugiyama,et al.  Mixture Regression for Covariate Shift , 2006, NIPS.

[135]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[136]  Roger Fletcher,et al.  New algorithms for singly linearly constrained quadratic programs subject to lower and upper bounds , 2006, Math. Program..

[137]  J. Horowitz,et al.  Identification and estimation of statistical functionals using incomplete data , 2006 .

[138]  Koby Crammer,et al.  Analysis of Representations for Domain Adaptation , 2006, NIPS.

[139]  John Blitzer,et al.  Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[140]  Alexander Zien,et al.  Gaussian Processes and the Null-Category Noise Model , 2006 .

[141]  Bernhard Schölkopf,et al.  Correcting Sample Selection Bias by Unlabeled Data , 2006, NIPS.

[142]  Stefien Bickel ECML-PKDD Discovery Challenge 2006 Overview , 2006 .

[143]  Steffen Bickel,et al.  Dirichlet-Enhanced Spam Filtering based on Biased Samples , 2006, NIPS.

[144]  Alexander J. Smola,et al.  Convex Learning with Invariances , 2007, NIPS.

[145]  Bernhard Schölkopf,et al.  Kernel Measures of Conditional Dependence , 2007, NIPS.

[146]  Motoaki Kawanabe,et al.  Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation , 2007, NIPS.

[147]  Klaus-Robert Müller,et al.  Covariate Shift Adaptation by Importance Weighted Cross Validation , 2007, J. Mach. Learn. Res..

[148]  Olivier Chapelle,et al.  Training a Support Vector Machine in the Primal , 2007, Neural Computation.

[149]  Lawrence Carin,et al.  Multi-Task Learning for Classification with Dirichlet Process Priors , 2007, J. Mach. Learn. Res..

[150]  Masashi Sugiyama,et al.  Generalization Error Estimation for Non-linear Learning Methods , 2007, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[151]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2007, ICML '07.

[152]  Takafumi Kanamori,et al.  Pool-based active learning with optimal sampling distribution and its information geometrical interpretation , 2007, Neurocomputing.

[153]  Edwin V. Bonilla,et al.  Kernel Multi-task Learning using Task-specific Features , 2007, AISTATS.

[154]  Alexander J. Smola,et al.  A scalable modular convex solver for regularized risk minimization , 2007, KDD '07.

[155]  Thomas Hofmann,et al.  Active learning for misspecified generalized linear models , 2007 .

[156]  Hidetoshi Shimodaira Testing Regions with Nonsmooth Boundaries via Multiscale Bootstrap , 2008 .

[157]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[158]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[159]  E. B. Davies TOWARDS A PHILOSOPHY OF REAL MATHEMATICS , 2011 .