Tractable algorithms for proximity search on large graphs

Identifying the nearest neighbors of a node in a graph is a key ingredient in a diverse set of ranking problems, e.g. friend suggestion in social networks, keyword search in databases, web-spam detection etc. For finding these "near" neighbors, we need graph theoretic measures of similarity or proximity. Most popular graph-based similarity measures, e.g. length of shortest path, the number of common neighbors etc., look at the paths between two nodes in a graph. One such class of similarity measures arise from random walks. In the context of using these measures, we identify and address two important problems. First, we note that, while random walk based measures are useful, they are often hard to compute. Hence we focus on designing tractable algorithms for faster and better ranking using random walk based proximity measures in large graphs. Second, we theoretically justify why path-based similarity measures work so well in practice. For the first problem, we focus on improving the quality and speed of nearest neighbor search in real-world graphs. This work consists of three main components: first we present an algorithmic framework for computing nearest neighbors in truncated hitting and commute times, which are proximity measures based on short term random walks. Second, we improve upon this ranking by incorporating user feedback, which can counteract ambiguities in queries and data. Third, we address the problem of nearest neighbor search when the underlying graph is too large to fit in main memory. We also prove a number of interesting theoretical properties of these measures, which have been key to designing most of the algorithms in this thesis. We address the second problem by bringing together a well known generative model for link formation, and geometric intuitions. As a measure of the quality of ranking, we examine link prediction, which has been the primary tool for evaluating the algorithms in this thesis. Link prediction has been extensively studied in prior empirical surveys. Our work helps us better understand some common trends in the predictive performance of different measures seen across these empirical results.

[1]  Shlomo Moran,et al.  SALSA: the stochastic approach for link-structure analysis , 2001, TOIS.

[2]  Arik Azran,et al.  The rendezvous algorithm: multiclass semi-supervised learning with Markov random walks , 2007, ICML '07.

[3]  Xiaojin Zhu,et al.  Semi-Supervised Learning Literature Survey , 2005 .

[4]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[5]  Dániel Fogaras,et al.  Towards Scaling Fully Personalized PageRank: Algorithms, Lower Bounds, and Experiments , 2005, Internet Math..

[6]  Jianbo Shi,et al.  A Random Walks View of Spectral Segmentation , 2001, AISTATS.

[7]  Allan Borodin,et al.  Finding authorities and hubs from link structures on the World Wide Web , 2001, WWW '01.

[8]  John A. Tomlin,et al.  A new paradigm for ranking pages on the world wide web , 2003, WWW '03.

[9]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[10]  John E. Hopcroft,et al.  Manipulation-Resistant Reputations Using Hitting Time , 2007, Internet Math..

[11]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[12]  Dániel Fogaras,et al.  Scaling link-based similarity search , 2005, WWW '05.

[13]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[14]  F. Chung Laplacians and the Cheeger Inequality for Directed Graphs , 2005 .

[15]  Ronald Rosenfeld,et al.  Semi-supervised learning with graphs , 2005 .

[16]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[17]  Yoram Singer,et al.  An Efficient Boosting Algorithm for Combining Preferences by , 2013 .

[18]  Purnamrita Sarkar,et al.  Fast nearest-neighbor search in disk-resident graphs , 2010, KDD.

[19]  Edwin R. Hancock,et al.  Image Segmentation using Commute Times , 2005, BMVC.

[20]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[21]  Jon M. Kleinberg,et al.  Clustering categorical data: an approach based on dynamical systems , 2000, The VLDB Journal.

[22]  Purnamrita Sarkar,et al.  WWW 2009 MADRID! Track: Data Mining / Session: Graph Algorithms Fast Dynamic Reranking in Large Graphs , 2022 .

[23]  Marc Najork,et al.  Efficient and effective link analysis with precomputed salsa maps , 2008, CIKM '08.

[24]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[25]  Soumen Chakrabarti,et al.  Learning Parameters in Entity Relationship Graphs from Ranking Preferences , 2006, PKDD.

[26]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[27]  Hsinchun Chen,et al.  CrimeLink Explorer: Using Domain Knowledge to Facilitate Automated Crime Association Analysis , 2003, ISI.

[28]  Purnamrita Sarkar,et al.  Fast incremental proximity search in large graphs , 2008, ICML '08.

[29]  Ravi Kumar,et al.  Anchor-based proximity measures , 2007, WWW '07.

[30]  Vagelis Hristidis,et al.  ObjectRank: Authority-Based Keyword Search in Databases , 2004, VLDB.

[31]  Yehuda Koren,et al.  Measuring and extracting proximity in networks , 2006, KDD '06.

[32]  Gene H. Golub,et al.  Extrapolation methods for accelerating PageRank computations , 2003, WWW '03.

[33]  Jon Kleinberg,et al.  The link prediction problem for social networks , 2003, CIKM '03.

[34]  Fan Chung Graham,et al.  Local Partitioning for Directed Graphs Using PageRank , 2007, Internet Math..

[35]  Jennifer Widom,et al.  A First Course in Database Systems , 1997 .

[36]  Shang-Hua Teng,et al.  Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems , 2003, STOC '04.

[37]  William W. Cohen,et al.  Contextual search and name disambiguation in email using graphs , 2006, SIGIR.

[38]  Ah Chung Tsoi,et al.  Adaptive ranking of web pages , 2003, WWW '03.

[39]  Soumen Chakrabarti,et al.  SPIN: searching personal information networks , 2005, SIGIR '05.

[40]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[41]  Torsten Suel,et al.  Local methods for estimating pagerank values , 2004, CIKM '04.

[42]  R. Basri,et al.  Shape representation and classification using the Poisson equation , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[43]  Wray L. Buntine Operations for Learning with Graphical Models , 1994, J. Artif. Intell. Res..

[44]  Alain Pirotte,et al.  A novel way of computing dissimilarities between nodes of a graph , 2004 .

[45]  Purnamrita Sarkar,et al.  A Tractable Approach to Finding Closest Truncated-commute-time Neighbors in Large Graphs , 2007, UAI.

[46]  Lise Getoor,et al.  Learning Probabilistic Relational Models , 1999, IJCAI.

[47]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[48]  François Fouss,et al.  The Principal Components Analysis of a Graph, and Its Relationships to Spectral Clustering , 2004, ECML.

[49]  Leo Grady,et al.  Random Walks for Image Segmentation , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Yi Zhang,et al.  Incorporating Diversity and Density in Active Learning for Relevance Feedback , 2007, ECIR.

[51]  Leo Katz,et al.  A new status index derived from sociometric analysis , 1953 .

[52]  Alan J. Mayne,et al.  Generalized Inverse of Matrices and its Applications , 1972 .

[53]  David Harel,et al.  On Clustering Using Random Walks , 2001, FSTTCS.

[54]  Carlos Castillo,et al.  Web spam identification through content and hyperlinks , 2008, AIRWeb '08.

[55]  Nikhil Srivastava,et al.  Graph sparsification by effective resistances , 2008, SIAM J. Comput..

[56]  David Heckerman,et al.  Probabilistic Entity-Relationship Models, PRMs, and Plate Models , 2004 .

[57]  Panayiotis Tsaparas,et al.  Using non-linear dynamical systems for web searching and ranking , 2004, PODS.

[58]  S. Sudarshan,et al.  Keyword search on external memory data graphs , 2008, Proc. VLDB Endow..

[59]  Soumen Chakrabarti,et al.  Learning random walks to rank nodes in graphs , 2007, ICML '07.

[60]  Sharon L. Milgram,et al.  The Small World Problem , 1967 .

[61]  Hang Li,et al.  Ranking refinement and its application to information retrieval , 2008, WWW.

[62]  Kenneth Ward Church,et al.  Query suggestion using hitting time , 2008, CIKM '08.

[63]  Sreenivas Gollapudi,et al.  Estimating PageRank on graph streams , 2008, PODS.

[64]  Prabhakar Raghavan,et al.  The electrical resistance of a graph captures its commute and cover times , 2005, computational complexity.

[65]  Baoning Wu,et al.  Extracting link spam using biased random walks from spam seed sets , 2007, AIRWeb '07.

[66]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[67]  Tommi S. Jaakkola,et al.  Partially labeled classification with Markov random walks , 2001, NIPS.

[68]  László Lovász,et al.  Random Walks on Graphs: A Survey , 1993 .

[69]  Prabhakar Raghavan,et al.  Social Networks: From the Web to the Enterprise , 2002, IEEE Internet Comput..

[70]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[71]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[72]  Kevyn Collins-Thompson,et al.  Query expansion using random walk models , 2005, CIKM '05.

[73]  J. Gower A General Coefficient of Similarity and Some of Its Properties , 1971 .

[74]  Christos Faloutsos,et al.  Fast discovery of connection subgraphs , 2004, KDD.

[75]  R. Alba,et al.  Bonds of Pluralism: The Form and Substance of Urban Social Networks. , 1974 .

[76]  Lada A. Adamic,et al.  Friends and neighbors on the Web , 2003, Soc. Networks.

[77]  András A. Benczúr,et al.  To randomize or not to randomize: space optimal summaries for hyperlink analysis , 2006, WWW '06.

[78]  Matthew Brand,et al.  A Random Walks Perspective on Maximizing Satisfaction and Profit , 2005, SDM.

[79]  Zoubin Ghahramani,et al.  A new approach to data driven clustering , 2006, ICML.

[80]  Silvio Lattanzi,et al.  On compressing social networks , 2009, KDD.

[81]  Peter D. Hoff,et al.  Latent Space Approaches to Social Network Analysis , 2002 .

[82]  Alexander J. Smola,et al.  Kernels and Regularization on Graphs , 2003, COLT.

[83]  Peter G. Doyle,et al.  Random walks and electric networks , 1987, math/0001057.

[84]  Sriram Raghavan,et al.  WebBase: a repository of Web pages , 2000, Comput. Networks.

[85]  Christos Faloutsos,et al.  Fast direction-aware proximity for graph mining , 2007, KDD '07.

[86]  Katherine Faust Comparison of methods for positional analysis: Structural and general equivalences☆ , 1988 .

[87]  Pavel Berkhin,et al.  Bookmark-Coloring Algorithm for Personalized PageRank Computing , 2006, Internet Math..

[88]  K. S. Banerjee Generalized Inverse of Matrices and Its Applications , 1973 .

[89]  Jon M. Kleinberg,et al.  The small-world phenomenon: an algorithmic perspective , 2000, STOC '00.

[90]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[91]  Fan Chung Graham,et al.  Local Graph Partitioning using PageRank Vectors , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[92]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[93]  J. Siemons Surveys in combinatorics, 1989 , 1989 .

[94]  Edwin R. Hancock,et al.  Robust Multi-body Motion Tracking Using Commute Time Clustering , 2006, ECCV.

[95]  K. Fuast Comparison of methods for positional analysis: Structural and general equivalences , 1988 .

[96]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[97]  Purnamrita Sarkar,et al.  Theoretical Justification of Popular Link Prediction Heuristics , 2011, IJCAI.

[98]  Soumen Chakrabarti,et al.  Learning to rank networked entities , 2006, KDD '06.

[99]  E. Schwartz,et al.  Isoperimetric Graph Partitioning for Data Clustering and Image Segmentation , 2003 .

[100]  Andrew Y. Ng,et al.  Learning random walk models for inducing word dependency distributions , 2004, ICML.

[101]  Dani Lischinski,et al.  Colorization using optimization , 2004, SIGGRAPH 2004.

[102]  Sreenivas Gollapudi,et al.  Less is more: sampling the neighborhood graph makes SALSA better and faster , 2009, WSDM '09.

[103]  Carl D. Meyer,et al.  Deeper Inside PageRank , 2004, Internet Math..

[104]  M. McPherson,et al.  Birds of a Feather: Homophily in Social Networks , 2001 .

[105]  Wenbo Zhao,et al.  PageRank and Random Walks on Graphs , 2010 .

[106]  Xiaojin Zhu,et al.  Harmonic mixtures: combining mixture models and graph-based methods for inductive and scalable semi-supervised learning , 2005, ICML.

[107]  Gary L. Miller,et al.  A linear work, O(n1/6) time, parallel algorithm for solving planar Laplacians , 2007, SODA '07.

[108]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[109]  Zhi-Li Zhang,et al.  Commute Times for a Directed Graph using an Asymmetric Laplacian , 2011 .

[110]  Graham Cormode,et al.  An Improved Data Stream Summary: The Count-Min Sketch and Its Applications , 2004, LATIN.

[111]  Soumen Chakrabarti,et al.  Dynamic personalized pagerank in entity-relation graphs , 2007, WWW '07.