Computing the Entropy of User Navigation in the Web

Navigation through the web, colloquially known as "surfing", is one of the main activities of users during web interaction. When users follow a navigation trail they often tend to get disoriented in terms of the goals of their original query and thus the discovery of typical user trails could be useful in providing navigation assistance. Herein, we give a theoretical underpinning of user navigation in terms of the entropy of an underlying Markov chain modelling the web topology. We present a novel method for online incremental computation of the entropy and a large deviation result regarding the length of a trail to realize the said entropy. We provide an error analysis for our estimation of the entropy in terms of the divergence between the empirical and actual probabilities. We then indicate applications of our algorithm in the area of web data mining. Finally, we present an extension of our technique to higher-order Markov chains by a suitable reduction of a higher-order Markov chain model to a first-order one.

[1]  Ravi Kumar,et al.  Extracting Large-Scale Knowledge Bases from the Web , 1999, VLDB.

[2]  Jakob Nielsen,et al.  Hypertext and hypermedia , 1990 .

[3]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[4]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[5]  Aaron D. Wyner,et al.  On the Role of Pattern Matching in Information Theory , 1998, IEEE Trans. Inf. Theory.

[6]  Ray R. Larson,et al.  Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace , 1996 .

[7]  Michael D. Smith,et al.  Using Path Profiles to Predict HTTP Requests , 1998, Comput. Networks.

[8]  John G. Kemeny,et al.  Finite Markov Chains. , 1960 .

[9]  Mark Levene,et al.  Zipf's Law for Web Surfers , 2001, Knowledge and Information Systems.

[10]  Huberman,et al.  Strong regularities in world wide web surfing , 1998, Science.

[11]  P. Billingsley,et al.  Statistical Methods in Markov Chains , 1961 .

[12]  Jon M. Kleinberg,et al.  Mining the Web's Link Structure , 1999, Computer.

[13]  Mark Levene,et al.  Kemeny's Constant and the Random Surfer , 2002, Am. Math. Mon..

[14]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[15]  Aleksandr Yakovlevich Khinchin,et al.  Mathematical foundations of information theory , 1959 .

[16]  Ramana Rao,et al.  Silk from a sow's ear: extracting usable structures from the Web , 1996, CHI.

[17]  Mark Levene,et al.  Data Mining of User Navigation Patterns , 1999, WEBKDD.

[18]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[19]  Thomas M. Cover,et al.  Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing) , 2006 .

[20]  Mark Levene,et al.  Navigation in Hypertext Is Easy Only Sometimes , 1999, SIAM J. Comput..

[21]  M. Kac On the notion of recurrence in discrete stochastic processes , 1947 .

[22]  Jukka Teuhola,et al.  Application of a Finite-State Model to Text Compression , 1993, Comput. J..

[23]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[24]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[25]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[26]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[27]  Jorma Rissanen,et al.  Complexity of strings in the class of Markov sources , 1986, IEEE Trans. Inf. Theory.

[28]  Ian H. Witten,et al.  Modeling for text compression , 1989, CSUR.

[29]  Peter Pirolli,et al.  Distributions of surfers' paths through the World Wide Web: Empirical characterizations , 1999, World Wide Web.

[30]  Philip S. Yu,et al.  Efficient Data Mining for Path Traversal Patterns , 1998, IEEE Trans. Knowl. Data Eng..

[31]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[32]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[33]  Mark Levene,et al.  Constructing Web Views from Automated Navigation Sessions , 1999, WOWS.

[34]  Chris Chatfield,et al.  Statistical Inference Regarding Markov Chain Models , 1973 .

[35]  Timothy C. Bell,et al.  A Note on the DMC Data Compression Scheme , 1989, Computer/law journal.

[36]  R. Nigel Horspool,et al.  Data Compression Using Dynamic Markov Modelling , 1987, Comput. J..

[37]  Yuri M. Suhov,et al.  Nonparametric Entropy Estimation for Stationary Processesand Random Fields, with Applications to English Text , 1998, IEEE Trans. Inf. Theory.

[38]  John G. Kemeny,et al.  Finite Markov chains , 1960 .

[39]  Marc Najork,et al.  Measuring Index Quality Using Random Walks on the Web , 1999, Comput. Networks.

[40]  Mark Levene,et al.  A Probabilistic Approach to Navigation in Hypertext , 1999, Inf. Sci..

[41]  Mark Levene,et al.  Web Interaction and the Navigation Problem in Hypertext written for Encyclopedia of Microcomputers , 2001 .

[42]  Tim Oren Memex: getting back on the trail , 1991 .

[43]  Ga Miller,et al.  Note on the bias of information estimates , 1955 .

[44]  Vannevar Bush,et al.  As we may think , 1945, INTR.

[45]  Jeffrey S. Rosenthal,et al.  Convergence Rates for Markov Chains , 1995, SIAM Rev..

[46]  Nancy L. Geller,et al.  On the citation influence methodology of Pinski and Narin , 1978, Inf. Process. Manag..

[47]  James M. Nyce,et al.  From Memex To Hypertext: Vannevar Bush and the Mind's Machine , 1991 .

[48]  Ali Esmaili,et al.  Probability and Random Processes , 2005, Technometrics.

[49]  Gabriel Pinski,et al.  Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics , 1976, Inf. Process. Manag..

[50]  Isaac M. Sonin,et al.  The State Reduction and Related Algorithms and Their Applications to the Study of Markov Chains, Graph Theory, and the Optimal Stopping Problem , 1999 .

[51]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[52]  Neri Merhav,et al.  On the estimation of the order of a Markov chain and universal data compression , 1989, IEEE Trans. Inf. Theory.