Domain Adaptation: Learning Bounds and Algorithms

This paper addresses the general problem of domain adaptation which arises in a variety of applications where the distribution of the labeled sample available somewhat differs from that of the test data. Building on previous work by Ben-David et al. (2007), we introduce a novel distance between distributions, discrepancy distance, that is tailored to adaptation problems with arbitrary loss functions. We give Rademacher complexity bounds for estimating the discrepancy distance from finite samples for different loss functions. Using this distance, we derive new generalization bounds for domain adaptation for a wide family of loss functions. We also present a series of novel adaptation bounds for large classes of regularization-based algorithms, including support vector machines and kernel ridge regression based on the empirical discrepancy. This motivates our analysis of the problem of minimizing the empirical discrepancy for various loss functions for which we also give several algorithms. We report the results of preliminary experiments that demonstrate the benefits of our discrepancy minimization algorithms for domain adaptation.

[1]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[2]  M. Overton On minimizing the maximum eigenvalue of a symmetric matrix , 1988 .

[3]  M. Talagrand,et al.  Probability in Banach Spaces: Isoperimetry and Processes , 1991 .

[4]  Robert L. Mercer,et al.  Adaptive language modeling using minimum discriminant estimation , 1992 .

[5]  F. Jarre An interior-point method for minimizing the maximum eigenvalue of a linear combination of matrices , 1993 .

[6]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[7]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[8]  Farid Alizadeh,et al.  Interior Point Methods in Semidefinite Programming with Applications to Combinatorial Optimization , 1995, SIAM J. Optim..

[9]  Ronald Rosenfeld,et al.  A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..

[10]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[11]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[12]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[13]  Alexander Gammerman,et al.  Ridge Regression Learning Algorithm in Dual Variables , 1998, ICML.

[14]  Bernard Chazelle,et al.  The discrepancy method - randomness and complexity , 2000 .

[15]  V. Koltchinskii,et al.  Rademacher Processes and Bounding the Risk of Function Learning , 2004, math/0405338.

[16]  Christoph Helmberg,et al.  Bundle Methods to Minimize the Maximum Eigenvalue Function , 2000 .

[17]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[18]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[19]  Gang Yu,et al.  A Min-Max-Sum Resource Allocation Problem and Its Applications , 2001, Oper. Res..

[20]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[21]  Aleix M. Martínez,et al.  Recognizing Imprecisely Localized, Partially Occluded, and Expression Variant Faces from a Single Sample per Class , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Brian Roark,et al.  Supervised and unsupervised PCFG adaptation to novel domains , 2003, NAACL.

[23]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[24]  Alex Acero,et al.  Adaptation of Maximum Entropy Capitalizer: Little Data Can Help a Lo , 2006, Comput. Speech Lang..

[25]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[26]  Daniel Marcu,et al.  Domain Adaptation for Statistical Classifiers , 2006, J. Artif. Intell. Res..

[27]  Koby Crammer,et al.  Analysis of Representations for Domain Adaptation , 2006, NIPS.

[28]  ChengXiang Zhai,et al.  Instance Weighting for Domain Adaptation in NLP , 2007, ACL.

[29]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[30]  Koby Crammer,et al.  Learning Bounds for Domain Adaptation , 2007, NIPS.

[31]  Mehryar Mohri,et al.  Sample Selection Bias Correction Theory , 2008, ALT.

[32]  Yishay Mansour,et al.  Domain Adaptation with Multiple Sources , 2008, NIPS.