A spectral method to separate disconnected and nearly-disconnected web graph components

Separation of connected components from a graph with disconnected graph components mostly use breadth-first search (BFS) or depth-first search (DFS) graph algorithms. Here we propose a new algebraic method to separate disconnected and nearly-disconnected components. This method is based on spectral graph partitioning, following a key observation that disconnected components will show up, after properly sorted, as step-function like curve in the lowest eigenvectors of the Laplacian matrix of the graph. Following an perturbative analysis framework, we systematically analyzed the graph structures, first on the disconnected subgraph case, and second on the effects of adding edges sparsely connecting different subgraphs as a perturbation. Several new results are derived, providing insights to spectral methods and related clustering objective function. Examples are given illustrating the concepts and results our methods. Comparing to the standard graph algorithms, this method has the same O(‖E ‖ + ‖V‖log(‖V‖)) complexity, but is easier to implement (using readily available eigensolvers). Further more the method can easily identify articulation points and bridges on nearly-disconnected graphs. Segmentation of a real example of Web graph for query amazon is given. We found that each disconnected or nearly-disconnected components forms a cluster on a clear topic.

[1]  Robert E. Tarjan,et al.  Depth-First Search and Linear Graph Algorithms , 1972, SIAM J. Comput..

[2]  B. Parlett The Symmetric Eigenvalue Problem , 1981 .

[3]  W. Davidon,et al.  Mathematical Methods of Physics , 1965 .

[4]  A. Hoffman,et al.  Lower bounds for the partitioning of graphs , 1973 .

[5]  Ray R. Larson,et al.  Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace , 1996 .

[6]  John Greiner,et al.  AD-A 270 551 A Comparison of Data-Parallel Algorithms for Connected Components , 1994 .

[7]  J. Linnett,et al.  Quantum mechanics , 1975, Nature.

[8]  Hongyuan Zha,et al.  Web document clustering using hyperlink structures , 2001 .

[9]  M. Fiedler Algebraic connectivity of graphs , 1973 .

[10]  Alex Pothen,et al.  PARTITIONING SPARSE MATRICES WITH EIGENVECTORS OF GRAPHS* , 1990 .

[11]  L. Schiff,et al.  Quantum Mechanics, 3rd ed. , 1973 .

[12]  John Greiner,et al.  A comparison of parallel algorithms for connected components , 1994, SPAA '94.

[13]  Yanhong Li Toward A Qualitative Search Engine , 1998, IEEE Internet Comput..

[14]  Prabhakar Raghavan,et al.  Mining the Link Structure of the World Wide Web , 1998 .

[15]  Andrew B. Kahng,et al.  New spectral methods for ratio cut partitioning and clustering , 1991, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[16]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[17]  Uzi Vishkin,et al.  An O(log n) Parallel Connectivity Algorithm , 1982, J. Algorithms.

[18]  A. Messiah Quantum Mechanics , 1961 .

[19]  M. Fiedler A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory , 1975 .

[20]  N. Biggs Algebraic Graph Theory: The multiplicative expansion , 1974 .

[21]  Ramana Rao,et al.  Silk from a sow's ear: extracting usable structures from the Web , 1996, CHI.

[22]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[23]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[24]  Baruch Awerbuch,et al.  New Connectivity and MSF Algorithms for Shuffle-Exchange Network and PRAM , 1987, IEEE Transactions on Computers.

[25]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[26]  Ming Gu,et al.  Spectral min-max cut for graph partitioning and data clustering , 2001 .

[27]  C. Lee Giles,et al.  Efficient identification of Web communities , 2000, KDD '00.