Efficient identification of Web communities

We de ne a communit y on the web as a set of sites that have more links (in either direction) to members of the community than to non-members. Members of such a community can be eAEciently iden ti ed in a maximum ow / minim um cut framework, where the source is composed of known members, and the sink consists of well-kno wn non-members. A focused crawler that crawls to a xed depth can approximate community membership by augmenting the graph induced by the cra wl with links to a virtual sink node.The effectiveness of the approximation algorithm is demonstrated with several crawl results that iden tify hubs, authorities, w eb rings, and other link topologies that are useful but not easily categorized. Applications of our approach include focused cra wlers and search engines, automatic population of portal categories, and improved ltering.

[1]  D. R. Fulkerson,et al.  Maximal Flow Through a Network , 1956 .

[2]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[3]  Richard M. Karp,et al.  Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems , 1972, Combinatorial Optimization.

[4]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[5]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[6]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[7]  Eugene Garfield,et al.  Citation indexing: its theory and application in science , 1979 .

[8]  Shokri Z. Selim,et al.  K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Andrew V. Goldberg,et al.  A new approach to the maximum flow problem , 1986, STOC '86.

[10]  A. Goldberg,et al.  A new approach to the maximum-flow problem , 1988, JACM.

[11]  John Scott Social Network Analysis , 1988 .

[12]  Richard M. Leahy,et al.  An Optimal Graph Theoretic Approach to Data Clustering: Theory and Its Application to Image Segmentation , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Ravindra K. Ahuja,et al.  Network Flows: Theory, Algorithms, and Applications , 1993 .

[14]  Ray R. Larson,et al.  Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace , 1996 .

[15]  Andrew V. Goldberg,et al.  Experimental study of minimum cut algorithms , 1997, SODA '97.

[16]  Brian D. Davison,et al.  Human Performance on Clustering Web Pages: A Preliminary Study , 1998, KDD.

[17]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[18]  M. KleinbergJon Authoritative sources in a hyperlinked environment , 1999 .

[19]  Albert-László Barabási,et al.  Internet: Diameter of the World-Wide Web , 1999, Nature.

[20]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[21]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[22]  C. Lee Giles,et al.  Clustering and identifying temporal trends in document databases , 2000, Proceedings IEEE Advances in Digital Libraries 2000.