Web document clustering using hyperlink structures

WEB DOCUMENT CLUSTERING USING HYPERLINK STRUCTURES XIAOFENG HE y , HONGYUAN ZHA , CHRIS H.Q. DING y AND HORST D. SIMON y Abstract. With the exponential growth of information on the World Wide Web, there is great demand for developing e cient and e ective methods for organizing and retrieving the information available. Document clustering plays an important role in information retrieval and taxonomy man- agement for the World Wide Web and remains an interesting and challenging problem in the eld of web computing. In this paper we consider document clustering methods exploring textual infor- mation, hyperlink structure and co-citation relations. In particular, we apply the normalized-cut clustering method developed in computer vision to the task of hyperdocument clustering. We also explore some theoretical connections of the normalized-cut method to K-means method. We then experiment with normalized-cut method in the context of clustering query result sets for web search engines. Keywords. World Wide Web, graph partitioning, cheeger constant, clustering method, K-means method, normalized cut method, eigenvalue decomposition, power method. 1. Introduction. Currently the World Wide Web contains billions of documents and it is still growing rapidly. Finding the relevant documents to satisfy a user's infor- mation need is a very important and challenging task. Many commercial search en- gines have been developed and used by millions of people all over the world. However, the relevancy of documents returned in search engine result sets is still lacking, and further research and development is needed to really make search engines a ubiquitous information-seeking tool. The World Wide Web has a rich structure: it contains both textual web documents and the hyperlinks that connect them. The web documents and hyperlinks between them form a directed graph in which the web documents can be viewed as vertices and the hyperlinks as directed edges. Algorithms have been de- veloped utilizing this directed graph to extract information contained in a collection of hyperlinked web documents. Kleinberg proposed HITS algorithm based purely on hyperlink information to retrieve the most relevant information: authority and hub documents for a user query 20]. However, if the hypertext collection consists of sev- eral topics, authority and hub documents may only cover the most popular topics and leave out the less popular ones. One way to remedy this situation is to rst partition the hypertext collection into topical groups, and present the search results as a list of topics to the user. This leads to the need to cluster web documents based on both the textual and hyperlink information. There exists a large literature on clustering methods and algorithms 13, 19]. Gen- erally speaking, the purpose of cluster analysis is to organize the data into meaningful groups: the data objects in the same group are highly similar and those in di erent groups are dissimilar. Judging the e ectiveness of a clustering algorithm is di cult and usually application-dependent. In this paper, we apply a similarity-based cluster- ing method to the problem of clustering web documents. It utilizes a graph-theoretic criterion called normalized cut which has its root in the study of graph isoperimetric Department of Computer Science and Engineering, The Pennsylvania State University, Uni- versity Park, PA 16802, f xhe,zha g @cse.psu.edu . This work was supported in part by NSF grant CCR-9901986. y NERSC Division, Lawrence Berkeley National Laboratory, University of California, Berkeley, CA 94720, f xfhe,chqding,hdsimon g @lbl.gov . Supported by Department of Energy through an LBL LDRD fund.

[1]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[2]  Ray R. Larson,et al.  Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace , 1996 .

[3]  Andrei Z. Broder,et al.  A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[4]  Gene H. Golub,et al.  Matrix computations , 1983 .

[5]  Peter G. Anick Adapting a full-text information retrieval system to the computer troubleshooting domain , 1994, SIGIR '94.

[6]  Shang-Hua Teng,et al.  Spectral partitioning works: planar graphs and finite element meshes , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[7]  Prabhakar Raghavan,et al.  Mining the Link Structure of the World Wide Web , 1998 .

[8]  Bruce Hendrickson,et al.  An Improved Spectral Graph Partitioning Algorithm for Mapping Parallel Computations , 1995, SIAM J. Sci. Comput..

[9]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  Jon M. Kleinberg,et al.  Mining the Web's Link Structure , 1999, Computer.

[11]  C. Lee Giles,et al.  Efficient identification of Web communities , 2000, KDD '00.

[12]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[13]  Oren Etzioni,et al.  Grouper: A Dynamic Clustering Interface to Web Search Results , 1999, Comput. Networks.

[14]  M. Fiedler A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory , 1975 .

[15]  Efthimis N. Efthimiadis,et al.  A user-centred evaluation of ranking algorithms for interactive query expansion , 1993, SIGIR.

[16]  Bojan Mohar,et al.  Laplace eigenvalues of graphs - a survey , 1992, Discret. Math..

[17]  Yanhong Li Toward A Qualitative Search Engine , 1998, IEEE Internet Comput..

[18]  J. Cheeger A lower bound for the smallest eigenvalue of the Laplacian , 1969 .

[19]  Ramana Rao,et al.  Silk from a sow's ear: extracting usable structures from the Web , 1996, CHI.

[20]  W. Bruce Croft,et al.  Providing Government Information on the Internet: Experiences with THOMAS , 1995, DL.

[21]  M. Fiedler Algebraic connectivity of graphs , 1973 .

[22]  Alex Pothen,et al.  PARTITIONING SPARSE MATRICES WITH EIGENVECTORS OF GRAPHS* , 1990 .

[23]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[24]  Alan M. Frieze,et al.  Fast Monte-Carlo algorithms for finding low-rank approximations , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[25]  H. D. Simon,et al.  A spectral algorithm for envelope reduction of sparse matrices , 1993, Supercomputing '93. Proceedings.

[26]  Ravi Kumar,et al.  Extracting Large-Scale Knowledge Bases from the Web , 1999, VLDB.

[27]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[28]  Prabhakar Raghavan,et al.  Sparse matrix reordering schemes for browsing hypertext , 1996 .

[29]  Jon M. Kleinberg,et al.  The Web as a Graph: Measurements, Models, and Methods , 1999, COCOON.

[30]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[31]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[32]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[33]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[34]  Dirk Roose,et al.  An Improved Spectral Bisection Algorithm and its Application to Dynamic Load Balancing , 1995, EUROSIM International Conference.