Methods for MiningWeb Communities: Bibliometric, Spectral, and Flow

In this chapter, we examine the problem of Web community identification expressed in terms of the graph or network structure induced by the Web. While the task of community identification is obviously related to the more fundamental problems of graph partitioning and clustering, the basic task is differentiated from other problems by being within the Web domain. This single difference has many implications for how effective methods work, both in theory and in practice. In order of presentation, we will examine bibliometric similarity measures, bipartite community cores, the HITS algorithm, PageRank, and maximum flow-based Web communities. Interestingly, each of these topics relate to one another in a nontrivial manner.

[1]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[2]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[3]  Daniel A. Keim,et al.  On Knowledge Discovery and Data Mining , 1997 .

[4]  F. Chung,et al.  The average distances in random graphs with given expected degrees , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Daniel P. Fasulo,et al.  An Analysis of Recent Work on Clustering Algorithms , 1999 .

[6]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[7]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[8]  C. Lee Giles,et al.  Self-Organization and Identification of Web Communities , 2002, Computer.

[9]  Ravindra K. Ahuja,et al.  Network Flows: Theory, Algorithms, and Applications , 1993 .

[10]  T. C. Hu,et al.  Multi-Terminal Network Flows , 1961 .

[11]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[12]  C. Lee Giles,et al.  Efficient identification of Web communities , 2000, KDD '00.

[13]  Robert E. Tarjan,et al.  Maximum flow techniques for network clustering , 2002 .

[14]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[15]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[16]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[17]  Jon M. Kleinberg,et al.  An Impossibility Theorem for Clustering , 2002, NIPS.

[18]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[19]  F. Chung Spectral Graph Theory, Regional Conference Series in Math. , 1997 .

[20]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[21]  Audra E. Kosh,et al.  Linear Algebra and its Applications , 1992 .

[22]  D. R. Fulkerson,et al.  Maximal Flow Through a Network , 1956 .

[23]  Gary William Flake,et al.  Self-organization of the web and identification of communities , 2002 .

[24]  David R. Karger,et al.  A new approach to the minimum cut problem , 1996, JACM.

[25]  Robert E. Tarjan,et al.  Graph Clustering and Minimum Cut Trees , 2004, Internet Math..

[26]  Shlomo Moran,et al.  The stochastic approach for link-structure analysis (SALSA) and the TKC effect , 2000, Comput. Networks.

[27]  W. Klein,et al.  Bibliometrics , 2005, Social work in health care.

[28]  David M. Pennock,et al.  Winners don't take all: Characterizing the competition for links on the web , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Eugene Garfield,et al.  Citation indexing: its theory and application in science , 1979 .

[30]  Robert E. Tarjan,et al.  Network Flow Algorithms , 1989 .

[31]  James B. Orlin,et al.  A faster algorithm for finding the minimum cut in a graph , 1992, SODA '92.

[32]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[33]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.