论文信息 - Information Theoretic Learning

Information Theoretic Learning

INTRODUCTION Learning systems depend on three interrelated components: topologies, cost/performance functions, and learning algorithms. Topologies provide the constraints for the mapping, and the learning algorithms offer the means to find an optimal solution; but the solution is optimal with respect to what? Optimality is characterized by the criterion and in neural network literature, this is the least addressed component, yet it has a decisive influence in generalization performance. Certainly, the assumptions behind the selection of a criterion should be better understood and investigated. Traditionally, least squares has been the benchmark criterion for regression problems; considering classification as a regression problem towards estimating class posterior probabilities, least squares has been employed to train neural network and other classifier topologies to approximate correct labels. The main motivation to utilize least squares in regression simply comes from the intellectual comfort this criterion provides due to its success in traditional linear least squares regression applications – which can be reduced to solving a system of linear equations. For nonlinear regression, the assumption of Gaussianity for the measurement error combined with the maximum likelihood principle could be emphasized to promote this criterion. In nonparametric regression, least squares principle leads to the conditional expectation solution, which is intuitively appealing. Although these are good reasons to use the mean squared error as the cost, it is inherently linked to the assumptions and habits stated above. Consequently, there is information in the error signal that is not captured during the training of nonlinear adaptive systems under non-Gaussian distribution conditions when one insists on secondorder statistical criteria. This argument extends to other linear-second-order techniques such as principal component analysis (PCA), linear discriminant analysis (LDA), and canonical correlation analysis (CCA). Recent work tries to generalize these techniques to nonlinear scenarios by utilizing kernel techniques or other heuristics. This begs the question: what other alternative cost functions could be used to train adaptive systems and how could we establish rigorous techniques for extending useful concepts from linear and second-order statistical techniques to nonlinear and higher-order statistical learning methodologies?

Deniz Erdogmus | José Carlos Príncipe | Deniz Erdoğmuş | J. Príncipe

[1] Janghoon Yang,et al. The blind deconvolution of the multi-channel based on the higher order statistics , 2000, Conference Record of the Thirty-Fourth Asilomar Conference on Signals, Systems and Computers (Cat. No.00CH37154).

[2] Simon Haykin,et al. Neural Networks: A Comprehensive Foundation , 1998 .

[3] G. A. Barnard,et al. Transmission of Information: A Statistical Theory of Communications. , 1961 .

[4] G. Deco,et al. An Information-Theoretic Approach to Neural Computing , 1997, Perspectives in Neural Computing.

[5] Deniz Erdogmus,et al. Error whitening criterion for adaptive filtering: theory and algorithms , 2005, IEEE Transactions on Signal Processing.

[6] Deniz Erdogmus,et al. Insights on the relationship between probability of misclassification and information transfer through classifiers , 2002, Int. J. Comput. Syst. Signals.

[7] J. Príncipe,et al. Nonlinear extensions to the minimum average correlation energy filter , 1997 .

[8] Deniz Erdogmus,et al. Self-Consistent Locally Defined Principal Surfaces , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[9] Christian Jutten,et al. Source separation techniques applied to linear prediction , 2000 .

[10] José Carlos Príncipe,et al. Fast algorithm for adaptive blind equalization using order-α Renyi's entropy , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11] D. Erdogmus,et al. Convergence analysis of the information potential criterion in Adaline training , 2001, Neural Networks for Signal Processing XI: Proceedings of the 2001 IEEE Signal Processing Society Workshop (IEEE Cat. No.01TH8584).

[12] Jane Campbell Mazzagatti. KStore: A Dynamic Meta-Knowledge Repository for Intelligent BI , 2009, Int. J. Intell. Inf. Technol..

[13] Jean-Francois Cardoso,et al. Blind signal separation: statistical principles , 1998, Proc. IEEE.

[14] Frank Bärmann,et al. A learning algorithm for multilayered neural networks based on linear least squares problems , 1993, Neural Networks.

[15] E. Jaynes. Information Theory and Statistical Mechanics , 1957 .

[16] H. B. Barlow,et al. Finding Minimum Entropy Codes , 1989, Neural Computation.

[17] Ibrahim A. Ahmad,et al. A nonparametric estimation of the entropy for absolutely continuous distributions (Corresp.) , 1976, IEEE Trans. Inf. Theory.

[18] Chong-Yung Chi,et al. Cumulant-based inverse filter criteria for MIMO blind deconvolution: properties, algorithms, and application to DS/CDMA systems in multipath , 2001, IEEE Trans. Signal Process..

[19] J. Príncipe,et al. Energy, entropy and information potential for neural computation , 1998 .

[20] Roberto Battiti,et al. First- and Second-Order Methods for Learning: Between Steepest Descent and Newton's Method , 1992, Neural Computation.

[21] L. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[22] Aapo Hyvärinen,et al. Survey on Independent Component Analysis , 1999 .

[23] B. Farhang-Boroujeny,et al. Adaptive Filters: Theory and Applications , 1999 .

[24] José Carlos Príncipe,et al. Optimization in companion search spaces: the case of cross-entropy and the Levenberg-Marquardt algorithm , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[25] B. Ripley,et al. Robust Statistics , 2018, Wiley Series in Probability and Statistics.

[26] H. Joe. Estimation of entropy and other functionals of a multivariate density , 1989 .

[27] Deniz Erdogmus,et al. An analysis of entropy estimators for blind source separation , 2006, Signal Process..

[28] Pierre Comon,et al. Independent component analysis, A new concept? , 1994, Signal Process..

[29] Ralph Linsker,et al. A Local Learning Rule That Enables Information Maximization for Arbitrary Input Distributions , 1997, Neural Computation.

[30] Solomon Kullback,et al. Information Theory and Statistics , 1970, The Mathematical Gazette.

[31] Sang Joon Kim,et al. A Mathematical Theory of Communication , 2006 .

[32] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[33] A. Benveniste,et al. Robust identification of a nonminimum phase system: Blind adjustment of a linear equalizer in data communications , 1980 .

[34] Jean-François Bercher,et al. Estimating the entropy of a signal with applications , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[35] Paul A. Viola,et al. Learning Informative Statistics: A Nonparametnic Approach , 1999, NIPS.

[36] John W. Fisher,et al. A novel measure for independent component analysis (ICA) , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[37] S. Qureshi,et al. Adaptive equalization , 1982, Proceedings of the IEEE.

[38] S. Kuppuswami,et al. A Generic Internal State Paradigm for the Language Faculty of Agents for Task Delegation , 2008, Int. J. Intell. Inf. Technol..

[39] Chris Bishop,et al. Current address: Microsoft Research, , 2022 .

[40] D. V. Gokhale,et al. Entropy expressions and their estimators for multivariate distributions , 1989, IEEE Trans. Inf. Theory.

[41] Alfréd Rényi,et al. Probability Theory , 1970 .

[42] Peng-Yeng Yin,et al. A Bayesian Framework for Improving Clustering Accuracy of Protein Sequences Based on Association Rules , 2008 .

[43] Norbert Wiener,et al. Extrapolation, Interpolation, and Smoothing of Stationary Time Series , 1964 .

[44] Prasanna K. Sahoo,et al. Threshold selection using Renyi's entropy , 1997, Pattern Recognit..

[45] Vladimir N. Vapnik,et al. The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[46] Alejandro Pazos Sierra,et al. Encyclopedia of Artificial Intelligence , 2008 .

[47] L. Györfi,et al. Density-free convergence properties of various estimators of entropy , 1987 .

[48] Yongjian Fu,et al. Improving Mobile Web Navigation Using N-Grams Prediction Models , 2007, Int. J. Intell. Inf. Technol..

[49] C. L. Nikias,et al. Higher-order spectra analysis : a nonlinear signal processing framework , 1993 .

[50] Christian Cachin,et al. Smooth Entropy and Rényi Entropy , 1997, EUROCRYPT.

[51] Terrence J. Sejnowski,et al. An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[52] P. Bickel,et al. Sums of Functions of Nearest Neighbor Distances, Moment Bounds, Limit Theorems and a Goodness of Fit Test , 1983 .

[53] P. Hall. Limit theorems for sums of general functions of m-spacings , 1984, Mathematical Proceedings of the Cambridge Philosophical Society.

[54] Deniz Erdogmus,et al. Entropy minimization for supervised digital communications channel equalization , 2002, IEEE Trans. Signal Process..

[55] E. Parzen. On Estimation of a Probability Density Function and Mode , 1962 .

[56] M. A. Styblinski,et al. Experiments in nonconvex optimization: Stochastic approximation with function smoothing and simulated annealing , 1990, Neural Networks.

[57] Richard J. Duro,et al. Discrete-time backpropagation for training synaptic delay-based artificial neural networks , 1999, IEEE Trans. Neural Networks.

[58] R. Hartley. Transmission of information , 1928 .

[59] Deniz Erdoğmuş,et al. COMPARISON OF ENTROPY AND MEAN SQUARE ERROR CRITERIA IN ADAPTIVE SYSTEM TRAINING USING HIGHER ORDER STATISTICS , 2004 .

[60] Etienne Barnard,et al. Optimization for training neural nets , 1992, IEEE Trans. Neural Networks.

[61] P. Grassberger,et al. Characterization of Strange Attractors , 1983 .

[62] David Luengo,et al. Potential Energy and Particle Interaction Approach for Learning in Adaptive Systems , 2002, ICANN.

[63] José Carlos Príncipe,et al. Generalized correlation function: definition, properties, and application to blind equalization , 2006, IEEE Transactions on Signal Processing.

[64] V. Sugumaran. The Inaugural Issue of the International Journal of Intelligent Information Technologies , 2005 .

[65] R. Gallager. Information Theory and Reliable Communication , 1968 .

[66] Claude E. Shannon,et al. A Mathematical Theory of Communications , 1948 .

[67] Jitendra Malik,et al. Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[68] Jan Beirlant,et al. The empirical distribution function and strong laws for functions of order statistics of uniform spacings , 1985 .

[69] L. Glass,et al. Understanding Nonlinear Dynamics , 1995 .

[70] Mark R. Titchener. A measure of information , 2000, Proceedings DCC 2000. Data Compression Conference.

[71] Christopher M. Bishop,et al. Neural networks for pattern recognition , 1995 .

[72] A. K. Rigler,et al. Accelerating the convergence of the back-propagation method , 1988, Biological Cybernetics.

[73] Amparo Alonso-Betanzos,et al. A Global Optimum Approach for One-Layer Neural Networks , 2002, Neural Computation.

[74] J. Príncipe,et al. Blind source separation using information measures in the time and frequency domains , 1999 .

[75] D. Donoho. ON MINIMUM ENTROPY DECONVOLUTION , 1981 .

[76] Jose C. Principe,et al. Neural and adaptive systems : fundamentals through simulations , 2000 .

[77] J. P. Burg,et al. Maximum entropy spectral analysis. , 1967 .

[78] D. P. Mittal. On additive and non-additive entropies , 1975, Kybernetika.

[79] Vladimir Vapnik,et al. The Nature of Statistical Learning , 1995 .

[80] Deniz Erdoğmuş,et al. ON-LINE MINIMUM MUTUAL INFORMATION METHOD FOR TIME-VARYING BLIND SOURCE SEPARATION , 2001 .

[81] C. E. SHANNON,et al. A mathematical theory of communication , 1948, MOCO.

[82] Robert A. Jacobs,et al. Increased rates of convergence through learning rate adaptation , 1987, Neural Networks.

[83] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.

[84] Gene H. Golub,et al. Matrix computations , 1983 .

[85] Deniz Erdogmus,et al. Generalized information potential criterion for adaptive system training , 2002, IEEE Trans. Neural Networks.

[86] Jing Zhao,et al. Simultaneous extraction of Principal Components using givens rotations and output variances , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[87] Deniz Erdogmus,et al. AN ON-LINE ADAPTATION ALGORITHM FOR ADAPTIVE SYSTEM TRAINING WITH MINIMUM ERROR ENTROPY: STOCHASTIC INFORMATION GRADIENT , 2001 .

[88] Ralph Linsker,et al. An Application of the Principle of Maximum Information Preservation to Linear Systems , 1988, NIPS.

[89] Norbert Wiener,et al. Extrapolation, Interpolation, and Smoothing of Stationary Time Series, with Engineering Applications , 1949 .

[90] Ralph Linsker,et al. Towards an Organizing Principle for a Layered Perceptual Network , 1987, NIPS.

[91] Mohammad Bagher Menhaj,et al. Training feedforward networks with the Marquardt algorithm , 1994, IEEE Trans. Neural Networks.

[92] Chen-Fang Chang,et al. Observer-based air fuel ratio control , 1998 .

[93] V. Roychowdhury,et al. NON-PARAMETRIC ICA , 2001 .

[94] Neri Merhav,et al. Relations between entropy and error probability , 1994, IEEE Trans. Inf. Theory.

[95] Deniz Erdoğmuş,et al. Information transfer through classifiers and its relation to probability of error , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[96] L. L. Campbell,et al. A Coding Theorem and Rényi's Entropy , 1965, Inf. Control..

[97] H. Vincent Poor,et al. A lower bound on the probability of error in multihypothesis testing , 1995, IEEE Trans. Inf. Theory.

[98] Robert Jenssen,et al. The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space , 2004, NIPS.

[99] F. Attneave,et al. The Organization of Behavior: A Neuropsychological Theory , 1949 .

[100] K. Torkkola. Visualizing class structure in data using mutual information , 2000, Neural Networks for Signal Processing X. Proceedings of the 2000 IEEE Signal Processing Society Workshop (Cat. No.00TH8501).

[101] Vijayan Sugumaran. Intelligent Information Technologies: Concepts, Methodologies, Tools and Applications , 2007 .

[102] Konstantinos I. Diamantaras,et al. PCA neural models and blind signal separation , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[103] George Cybenko,et al. Approximation by superpositions of a sigmoidal function , 1992, Math. Control. Signals Syst..

[104] Deniz Erdogmus,et al. Do Hebbian synapses estimate entropy? , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[105] William R. Saunders,et al. Adaptive Structures: Dynamics and Control , 1998 .

[106] Leslie Greengard,et al. The Fast Gauss Transform , 1991, SIAM J. Sci. Comput..

[107] P. Jones. A Diary on Information Theory , 1989 .

[108] E. Oja. Simplified neuron model as a principal component analyzer , 1982, Journal of mathematical biology.

[109] Deniz Erdoğmuş,et al. Online entropy manipulation: stochastic information gradient , 2003, IEEE Signal Processing Letters.

[110] Samy Bengio,et al. Use of genetic programming for the search of a new learning rule for neural networks , 1994, Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence.

[111] Deniz Erdoğmuş,et al. Blind source separation using Renyi's mutual information , 2001, IEEE Signal Processing Letters.

[112] Simon Haykin,et al. Introduction to Adaptive Filters , 1984 .

[113] Vijayan Sugumaran. Intelligent, Adaptive and Reasoning Technologies: New Developments and Applications , 2011 .

[114] Chun Chen,et al. A new hybrid recurrent neural network , 1999, ISCAS'99. Proceedings of the 1999 IEEE International Symposium on Circuits and Systems VLSI (Cat. No.99CH36349).

[115] Erkki Oja,et al. Subspace methods of pattern recognition , 1983 .

[116] Aapo Hyvärinen,et al. Fast and robust fixed-point algorithms for independent component analysis , 1999, IEEE Trans. Neural Networks.

[117] Deniz Erdogmus,et al. A Neural Network Perspective to Extended Luenberger Observers , 2002 .

[118] Amiel Feinstein,et al. Transmission of Information. , 1962 .

[119] P. Gács. The Boltzmann entropy and randomness tests , 1994, Proceedings Workshop on Physics and Computation. PhysComp '94.

[120] Hasan Davulcu,et al. Boosting Item Findability: Bridging the Semantic Gap Between Search Phrases and Item Descriptions , 2006, Int. J. Intell. Inf. Technol..

[121] Shun-ichi Amari,et al. Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[122] Graham C. Goodwin,et al. Adaptive filtering prediction and control , 1984 .

[123] Miguel Á. Carreira-Perpiñán,et al. On the Number of Modes of a Gaussian Mixture , 2003, Scale-Space.

[124] J. Príncipe,et al. Entropy manipulation of arbitrary nonlinear mappings , 1997, Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop.

[125] Sun-Yuan Kung,et al. Principal Component Neural Networks: Theory and Applications , 1996 .

[126] Paul A. Viola,et al. Empirical Entropy Manipulation for Real-World Problems , 1995, NIPS.

[127] Wray L. Buntine,et al. Computing second derivatives in feed-forward networks: a review , 1994, IEEE Trans. Neural Networks.

[128] Kurt Hornik,et al. Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[129] J. Cardoso. Infomax and maximum likelihood for blind source separation , 1997, IEEE Signal Processing Letters.

[130] Reuven Y. Rubinstein,et al. Simulation and the Monte Carlo Method , 1981 .

[131] Pavel Pudil,et al. Introduction to Statistical Pattern Recognition , 2006 .

[132] C. D. Kemp,et al. Density Estimation for Statistics and Data Analysis , 1987 .

[133] Shun-ichi Amari,et al. Differential-geometrical methods in statistics , 1985 .

[134] Floris Takens,et al. On the numerical determination of the dimension of an attractor , 1985 .

[135] T. Hastie,et al. Principal Curves , 2007 .

[136] Richard O. Duda,et al. Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[137] Robert Jenssen,et al. Some Equivalences between Kernel Methods and Information Theoretic Methods , 2006, J. VLSI Signal Process..

[138] L. Györfi,et al. Nonparametric entropy estimation. An overview , 1997 .

[139] Deniz Erdogmus,et al. A Nonparametric Approach for Active Contours , 2007, 2007 International Joint Conference on Neural Networks.

[140] Durbin,et al. Biological Sequence Analysis , 1998 .

[141] D. Erdogmus,et al. Maximum Entropy Approximation for Kernel Machines , 2006, 2006 16th IEEE Signal Processing Society Workshop on Machine Learning for Signal Processing.

[142] Alexander J. Smola,et al. Learning with kernels , 1998 .

[143] J. Treichler,et al. A new approach to multipath correction of constant modulus signals , 1983 .

[144] Shun-ichi Amari,et al. Adaptive Online Learning Algorithms for Blind Separation: Maximum Entropy and Minimum Mutual Information , 1997, Neural Computation.

[145] Brian D. Ripley,et al. Pattern Recognition and Neural Networks , 1996 .

[146] John W. Fisher,et al. Learning from Examples with Information Theoretic Criteria , 2000, J. VLSI Signal Process..

[147] A. Rényi. On Measures of Entropy and Information , 1961 .

[148] F. P. Tarasenko. On the evaluation of an unknown probability density function, the direct estimation of the entropy from independent observations of a continuous random variable, and the distribution-free entropy test of goodness-of-fit , 1968 .

[149] Deniz Erdogmus,et al. Blind source separation of time-varying instantaneous mixtures using an on-line algorithm , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[150] M. Ben-Bassat,et al. Renyi's entropy and the probability of error , 1978, IEEE Trans. Inf. Theory.

[151] Victoria Y. Yoon,et al. Using Ontological Reasoning for an Adaptive E-Commerce Experience , 2009, Int. J. Intell. Inf. Technol..

[152] Bernard Widrow,et al. Improving the learning speed of 2-layer neural networks by choosing initial values of the adaptive weights , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[153] C. Beck,et al. Thermodynamics of chaotic systems , 1993 .

[154] E. Oja,et al. Independent Component Analysis , 2013 .

[155] Raymond L. Watrous. Learning Algorithms for Connectionist Networks: Applied Gradient Methods of Nonlinear Optimization , 1988 .

[156] J. Príncipe. The design and analysis of information processing systems ] From Linear Adaptive Filtering to Nonlinear Information Processing , 2009 .

[157] Jagat Narain Kapur,et al. Measures of information and their applications , 1994 .

[158] Deniz Erdogmus,et al. Lower and Upper Bounds for Misclassification Probability Based on Renyi's Information , 2004, J. VLSI Signal Process..

[159] Imre Csiszár,et al. Information Theory - Coding Theorems for Discrete Memoryless Systems, Second Edition , 2011 .

[160] Sergio Verdú,et al. Generalizing the Fano inequality , 1994, IEEE Trans. Inf. Theory.

[161] Teuvo Kohonen,et al. The self-organizing map , 1990, Neurocomputing.

[162] A. Tsybakov,et al. Root-N consistent estimators of entropy for densities with unbounded support , 1994, Proceedings of 1994 Workshop on Information Theory and Statistics.

[163] A. de Medeiros Martins,et al. Information Theoretic Mean Shift Algorithm , 2006, 2006 16th IEEE Signal Processing Society Workshop on Machine Learning for Signal Processing.

[164] Shun-ichi Amari,et al. Blind source separation-semiparametric statistical approach , 1997, IEEE Trans. Signal Process..

[165] Robert Jenssen,et al. Spectral feature projections that maximize Shannon mutual information with class labels , 2006, Pattern Recognit..

[166] Shun-ichi Amari,et al. Neural Learning in Structured Parameter Spaces - Natural Riemannian Gradient , 1996, NIPS.

[167] W. Edmonson,et al. A global least mean square algorithm for adaptive IIR filtering , 1998 .

[168] B. Silverman. Density estimation for statistics and data analysis , 1986 .

[169] Chuan Wang,et al. Training neural networks with additive noise in the desired signal , 1999, IEEE Trans. Neural Networks.

[170] Deniz Erdogmus,et al. Stochastic blind equalization based on PDF fitting using Parzen estimator , 2005, IEEE Transactions on Signal Processing.

[171] Athanasios Papoulis,et al. Probability, Random Variables and Stochastic Processes , 1965 .

[172] K. Loparo,et al. Optimal state estimation for stochastic systems: an information theoretic approach , 1997, IEEE Trans. Autom. Control..

[173] Joseph J Atick,et al. Could information theory provide an ecological theory of sensory processing? , 2011, Network.

[174] Keinosuke Fukunaga,et al. Introduction to Statistical Pattern Recognition , 1972 .

[175] Andreas S. Weigend,et al. Time Series Prediction: Forecasting the Future and Understanding the Past , 1994 .

[176] Sandro Ridella,et al. Statistically controlled activation weight initialization (SCAWI) , 1992, IEEE Trans. Neural Networks.

[177] Andrzej Cichocki,et al. Flexible Independent Component Analysis , 2000, J. VLSI Signal Process..

[178] Paulo Sergio Ramirez,et al. Fundamentals of Adaptive Filtering , 2002 .