Information estimators for weighted observations

The Shannon information content is a valuable numerical characteristic of probability distributions. The problem of estimating the information content from an observed dataset is very important in the fields of statistics, information theory, and machine learning. The contribution of the present paper is in proposing information estimators, and showing some of their applications. When the given data are associated with weights, each datum contributes differently to the empirical average of statistics. The proposed estimators can deal with this kind of weighted data. Similar to other conventional methods, the proposed information estimator contains a parameter to be tuned, and is computationally expensive. To overcome these problems, the proposed estimator is further modified so that it is more computationally efficient and has no tuning parameter. The proposed methods are also extended so as to estimate the cross-entropy, entropy, and Kullback-Leibler divergence. Simple numerical experiments show that the information estimators work properly. Then, the estimators are applied to two specific problems, distribution-preserving data compression, and weight optimization for ensemble regression.

[1]  R. Koenker,et al.  Regression Quantiles , 2007 .

[2]  David G. Stork,et al.  Pattern Classification , 1973 .

[3]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[4]  D. M. Titterington,et al.  On Smoothing Sparse Multinomial Data , 1987 .

[5]  Hideitsu Hino,et al.  A Computationally Efficient Information Estimator for Weighted Data , 2011, ICANN.

[6]  Seungjin Choi,et al.  Independent Component Analysis , 2009, Handbook of Natural Computing.

[7]  A. Haas,et al.  Uncertainties in Facies Proportion Estimation I. Theoretical Framework: The Dirichlet Distribution , 2002 .

[8]  Jun Shao,et al.  Estimation With Survey Data Under Nonignorable Nonresponse or Informative Sampling , 2002 .

[9]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[10]  Stephen Portnoy,et al.  Censored Regression Quantiles , 2003 .

[11]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[12]  L. Györfi,et al.  Density-free convergence properties of various estimators of entropy , 1987 .

[13]  Jacob Goldberger,et al.  ICA based on a Smooth Estimation of the Differential Entropy , 2008, NIPS.

[14]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[15]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[16]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[17]  Qing Wang,et al.  Divergence Estimation for Multidimensional Densities Via $k$-Nearest-Neighbor Distances , 2009, IEEE Transactions on Information Theory.

[18]  M. N. Goria,et al.  A new class of random vector entropy estimators and its applications in testing statistical hypotheses , 2005 .

[19]  Antonio Artés-Rodríguez,et al.  A Gaussian Mixture Based Maximization of Mutual Information for Supervised Feature Extraction , 2004, ICA.

[20]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[21]  Shie Mannor,et al.  The cross entropy method for classification , 2005, ICML.

[22]  R. Cook,et al.  Reweighting to Achieve Elliptically Contoured Covariates in Regression , 1994 .

[23]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[24]  John W. Fisher,et al.  ICA Using Spacings Estimates of Entropy , 2003, J. Mach. Learn. Res..

[25]  Hideitsu Hino,et al.  A Conditional Entropy Minimization Criterion for Dimensionality Reduction and Multiple Kernel Learning , 2010, Neural Computation.

[26]  Ralph Linsker,et al.  Towards an Organizing Principle for a Layered Perceptual Network , 1987, NIPS.

[27]  C. Quesenberry,et al.  A nonparametric estimate of a multivariate density function , 1965 .

[28]  Fernando Pérez-Cruz,et al.  Estimation of Information Theoretic Measures for Continuous Random Variables , 2008, NIPS.

[29]  Balaji Rajagopalan,et al.  A KERNEL ESTIMATOR FOR DISCRETE DISTRIBUTIONS , 1995 .

[30]  Neeraj Misra,et al.  Kn-nearest neighbor estimators of entropy , 2008 .

[31]  Samuel Kaski,et al.  Fast Semi-Supervised Discriminative Component Analysis , 2007, 2007 IEEE Workshop on Machine Learning for Signal Processing.

[32]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[33]  Xiaotong Shen,et al.  Empirical Likelihood , 2002 .

[34]  T. Therneau,et al.  An Introduction to Recursive Partitioning Using the RPART Routines , 2015 .

[35]  J. Heckman Sample selection bias as a specification error , 1979 .

[36]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[37]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[38]  L. Györfi,et al.  Nonparametric entropy estimation. An overview , 1997 .

[39]  Ralph Linsker,et al.  Improved local learning rule for information maximization and related applications , 2005, Neural Networks.

[40]  E. M. Stein,et al.  Real Analysis ( Princeton Lectures in Analysis III ) by , 2010 .

[41]  Pierre Rochus,et al.  Three-step censored quantile regression and extramarital affairs, J. Amer. Statist. Assoc., Journal of the American Statistical Association , 2002 .

[42]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[43]  Gediminas Adomavicius,et al.  Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions , 2005, IEEE Transactions on Knowledge and Data Engineering.

[44]  Bernhard Schölkopf,et al.  A Kernel Method for the Two-Sample-Problem , 2006, NIPS.

[45]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[46]  Jorma Laaksonen,et al.  LVQ_PAK: The Learning Vector Quantization Program Package , 1996 .

[47]  Matthew P. Wand,et al.  Kernel Smoothing , 1995 .

[48]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[49]  Alexander J. Smola,et al.  Nonparametric Quantile Estimation , 2006, J. Mach. Learn. Res..

[50]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[51]  Lan Wang,et al.  Locally Weighted Censored Quantile Regression , 2009 .

[52]  Le Song,et al.  Tailoring density estimation via reproducing kernel moment matching , 2008, ICML '08.

[53]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .