A New View of ICA

We present a new way of interpreting ICA as a probability density model and a new way of fitting this model to data. The advantage of our approach is that it suggests simple, novel extensions to overcomplete, undercomplete and multilayer non-linear versions of ICA. 1. ICA AS A CAUSAL GENERATIVE MODEL Factor analysis is based on a causal generative model in which an observation vector is generated in three stages. First, the activities of the factors (also known as latent or hidden variables) are chosen independently from one dimensional Gaussian priors. Next, these hidden activities are multiplied by a matrix of weights (the “factor loading” matrix) to produce a noise-free observation vector. Finally, independent Gaussian “sensor noise” is added to each component of the noise-free observation vector. Given an observation vector and a factor loading matrix, it is tractable to compute the posterior distribution of the hidden activities because this distribution is a Gaussian, though it generally has off-diagonal terms in the covariance matrix so it is not as simple as the prior distribution over hidden activities. ICA can also be viewed as a causal generative model [1, 2] that differs from factor analysis in two ways. First, the priors over the hidden activities remain independent but they are non-Gaussian. By itself, this modification would make it intractable to compute the posterior distribution over hidden activities. Tractability is restored by eliminating sensor noise and by using the same number of factors as input dimensions. This ensures that the posterior distribution over hidden activities collapses to a point. Interpreting ICA as a type of causal generative model suggests a number of ways in which it might be generalized, for instance to deal with more hidden units than input dimensions. Most of these generalizations retain marginal independence of the hidden activities and add sensor noise, but fail to preserve the property that the posterior distribution collapses to a point. As Funded by the Wellcome Trust and the Gatsby Charitable Foundation. a result inference is intractable and crude approximations are needed to model the posterior distribution, e.g., a MAP estimate in [3], a Laplace approximation in [4, 5] or more sophisticated variational approximations in [6]. 2. ICA AS AN ENERGY-BASED DENSITY MODEL We now describe a very different way of interpreting ICA as a probability density model. In the next section we describe how we can fit the model to data. The advantage of our energy-based view is that it suggests different generalizations of the basic ICA algorithm which preserve the computationally attractive property that the hidden activities are a simple deterministic function of the observed data. Instead of viewing the hidden factors as stochastic latent variables in a causal generative model, we view them as deterministic functions of the data with parameters . The hidden factors are then used for assigning an energy , to each possible observation vector :

[1]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[2]  Andrzej Cichocki,et al.  A New Learning Algorithm for Blind Signal Separation , 1995, NIPS.

[3]  Barak A. Pearlmutter,et al.  A Context-Sensitive Generalization of ICA , 1996 .

[4]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[5]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Song-Chun Zhu,et al.  Minimax Entropy Principle and Its Application to Texture Modeling , 1997, Neural Computation.

[7]  J. V. van Hateren,et al.  Independent component filters of natural images compared with simple cells in primary visual cortex , 1998, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[8]  J. H. Hateren,et al.  Independent component filters of natural images compared with simple cells in primary visual cortex , 1998 .

[9]  Hagai Attias,et al.  Independent Factor Analysis , 1999, Neural Computation.

[10]  Bruno A. Olshausen,et al.  PROBABILISTIC FRAMEWORK FOR THE ADAPTATION AND COMPARISON OF IMAGE CODES , 1999 .

[11]  Terrence J. Sejnowski,et al.  Learning Overcomplete Representations , 2000, Neural Computation.

[12]  Yee Whye Teh,et al.  Discovering Multiple Constraints that are Frequently Approximately Satisfied , 2001, UAI.

[13]  Aapo Hyvärinen,et al.  A two-layer sparse coding model learns simple and complex cell receptive fields and topography from natural images , 2001, Vision Research.

[14]  Mark D. Plumbley,et al.  IF THE INDEPENDENT COMPONENTS OF NATURAL IMAGES ARE EDGES, WHAT ARE THE INDEPENDENT COMPONENTS OF NATURAL SOUNDS? , 2001 .

[15]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.