Principal Component Analysis and Effective K-Means Clustering

The widely adopted K-means clustering algorithm uses a sum of squared error objective function. A detailed analysis shows the close relationship between K-means clustering and principal component analysis (PCA) which is extensively utilized in unsupervised dimension reduction. We prove that the continuous solutions of the discrete K-means clustering membership indicators are the data projections on the principal directions (principal eigenvectors of the covariance matrix). New lower bounds for K-means objective function are derived, which relate directly to the eigenvalues of the covariance matrix. Experiments on Internet newsgroups indicate that the new bounds are within 0.5-1.5% of the optimal values, and that PCA provides an effective solution for the K-means clustering. 1 Principal Component Analysis Principal component analysis (PCA)[5] in multivariate statistics is widely adopted as an effective unsupervised dimension reduction method and is extended in many different directions. The main justification of dimension reduction is that PCA uses singular value decomposition (SVD) which gives the best low rank approximation to original data in L2 norm due to Eckart-Young theorem. However, this essentially noise reduction perspective alone is inadequate to explain the effectiveness of PCA. In this paper, we provide a new perspective of PCA based on its close relationship with the K-means clustering algorithm. We show that the principal components are actually relaxed cluster membership indicators. Some background on PCA. The original n data points in m-dimensional space is contained in the data matrix (x1, · · · ,xn) = X. In general data is not centered around the origin. We define the centered data matrix Y = (y1, · · · ,yn), where yi = xi − x̄ and x̄ = ∑ i xi/n. The covarance matrix is given by S = ∑ i(xi − x̄)(xi − x̄)T = Y Y . The principal eigenvectors uk of Y Y T are the principal directions of the data Y . The principal eigenvectors vk of the Gram matrix Y T Y are the principal components; entries of each vk are the projected values of data points on the principal direction uk. vk and uk are related via: vk = Y T uk/λ 1/2 k . where λk is the eigenvalue of the covarance matrix Y Y . 2 K-means clustering The popular K-means algorithm [3] is an error minimization algorithm where the objective function is the sum of error squared,