Cross-language web page classification via dual knowledge transfer using nonnegative matrix tri-factorization

The lack of sufficient labeled Web pages in many languages, especially for those uncommonly used ones, presents a great challenge to traditional supervised classification methods to achieve satisfactory Web page classification performance. To address this, we propose a novel Nonnegative Matrix Tri-factorization (NMTF) based Dual Knowledge Transfer (DKT) approach for cross-language Web page classification, which is based on the following two important observations. First, we observe that Web pages for a same topic from different languages usually share some common semantic patterns, though in different representation forms. Second, we also observe that the associations between word clusters and Web page classes are a more reliable carrier than raw words to transfer knowledge across languages. With these recognitions, we attempt to transfer knowledge from the auxiliary language, in which abundant labeled Web pages are available, to target languages, in which we want classify Web pages, through two different paths: word cluster approximations and the associations between word clusters and Web page classes. Due to the reinforcement between these two different knowledge transfer paths, our approach can achieve better classification accuracy. We evaluate the proposed approach in extensive experiments using a real world cross-language Web page data set. Promising results demonstrate the effectiveness of our approach that is consistent with our theoretical analyses.

[1]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[2]  Douglas W. Oard,et al.  Cross-language text classification , 2005, SIGIR '05.

[3]  Chris H. Q. Ding,et al.  On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering , 2005, SDM.

[4]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[5]  Chris H. Q. Ding,et al.  Bridging Domains with Words: Opinion Analysis with Matrix Tri-factorizations , 2010, SDM.

[6]  Chris H. Q. Ding,et al.  Orthogonal nonnegative matrix t-factorizations for clustering , 2006, KDD '06.

[7]  Lei Shi,et al.  Cross Language Text Classification by Model Translation and Semi-Supervised Learning , 2010, EMNLP.

[8]  Fei Wang,et al.  Semi-Supervised Clustering via Matrix Factorization , 2008, SDM.

[9]  Xiaojun Wan,et al.  Co-Training for Cross-Lingual Sentiment Classification , 2009, ACL.

[10]  Chris H. Q. Ding,et al.  Knowledge transformation for cross-domain sentiment classification , 2009, SIGIR.

[11]  Chris H. Q. Ding,et al.  K-means clustering via principal component analysis , 2004, ICML.

[12]  David Pinto,et al.  Using Information from the Target Language to Improve Crosslingual Text Classification , 2010, IceTAL.

[13]  Panos Constantopoulos,et al.  Research and Advanced Technology for Digital Libraries , 2001, Lecture Notes in Computer Science.

[14]  Hui Xiong,et al.  Exploiting Associations between Word Clusters and Document Classes for Cross-Domain Text Categorization , 2010, SDM.

[15]  Quanquan Gu,et al.  Co-clustering on manifolds , 2009, KDD.

[16]  Chris H. Q. Ding,et al.  Spectral Relaxation for K-means Clustering , 2001, NIPS.

[17]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[18]  Gang Chen,et al.  Collaborative Filtering Using Orthogonal Nonnegative Matrix Tri-factorization , 2007 .

[19]  Chris H. Q. Ding,et al.  Convex and Semi-Nonnegative Matrix Factorizations , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Núria Bel,et al.  Cross-Lingual Text Categorization , 2003, ECDL.

[21]  Qiang Yang,et al.  Can chinese web pages be classified with english data source? , 2008, WWW.

[22]  John Blitzer,et al.  Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[23]  Ke Wu,et al.  A Refinement Framework for Cross Language Text Categorization , 2008, AIRS.

[24]  Benno Stein,et al.  Cross-Language Text Classification Using Structural Correspondence Learning , 2010, ACL.

[25]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.