Classification methods for high-dimensional sparse data

Estimation of predictive classification models from high-dimensional, low sample size (HDLSS) data is becoming increasingly important in various applications such as gene microarray analysis, image based object recognition, functional magnetic resonance imaging (fMRI) analysis etc. For these applications, the dimensionality of the data vector is much larger than the sample size. Such sparse data sets present new challenges for classification learning methods. Currently used algorithms include (a) dimensionality-reduction methods such as Linear Discriminant Analysis (LDA) and (b) margin-based methods such as Support Vector Machine (SVM). Both approaches effectively attempt to control the model complexity, but in a different way. Even though SVM and LDA have been introduced as general-purpose methodologies, their performance varies greatly depending on the statistical characteristics of available data. To gain a better understanding of these techniques, we analyze the properties of SVM and LDA classifiers applied to HDLSS data. We show that tuning the regularization parameter in Regularized LDA (RLDA) can alleviate the data piling phenomenon, thus providing one explanation why regularization is useful to improve performance of LDA for HDLSS data. Then we propose a very efficient algorithm to tune the regularization parameter of RLDA. For SVM, we show that when the regularization parameter C is larger than a threshold (which can be computed explicitly), SVM classifiers will perform similarly for 1-1DLSS data regardless of C. This result provides guidelines for practical application of SVM on real HDLSS data. Another principled approach is to consider new learning formulations when dealing with HDLSS data. Multi-task learning (MTL) has recently been introduced to the machine learning literature. We propose a novel joint feature selection framework under MTL setting. We propose a framework that embeds the feature selection process into the multitask learning. The benefits of the proposed method are in two folds. On the one hand, it compensates for small sample size problem of the task at hand by using additional samples from related tasks, thus fully taking advantage of the benefits offered by multitask learning. On the other hand, the feature selection mechanism reduces the essential dimensionality of data which can also improve generalization performance.