Personalization from incomplete data: what you don't know can hurt

Clickstream data collected at any web site (site-centric data) is inherently incomplete, since it does not capture users' browsing behavior across sites (user-centric data). Hence, models learned from such data may be subject to limitations, the nature of which has not been well studied. Understanding the limitations is particularly important since most current personalization techniques are based on site-centric data only. In this paper, we empirically examine the implications of learning from incomplete data in the context of two specific problems: (a) predicting if the remainder of any given session will result in a purchase and (b) predicting if a given user will make a purchase at any future session. For each of these problems we present new algorithms for fast and accurate data preprocessing of clickstream data. Based on a comprehensive experiment on user-level clickstream data gathered from 20,000 users' browsing behavior, we demonstrate that models built on user-centric data outperform models built on site-centric data for both prediction tasks.

[1]  Balaji Padmanabhan,et al.  The identification and satisfaction of consumer analysis‐driven information needs of marketers on the WWW , 1998 .

[2]  Donna L. Hoffman,et al.  New metrics for new media: toward the development of Web measurement standards , 1997, World Wide Web J..

[3]  Raghu Ramakrishnan,et al.  Proceedings : KDD 2000 : the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 20-23, 2000, Boston, MA, USA , 2000 .

[4]  Philip S. Yu,et al.  Online generation of association rules , 1998, Proceedings 14th International Conference on Data Engineering.

[5]  C. Theusinger,et al.  Analyzing the footsteps of your customers , 2000 .

[6]  Peter S. Fader,et al.  Which Visits Lead to Purchases? Dynamic Conversion Behavior at e-Commerce Sites , 2000 .

[7]  Balaji Padmanabhan,et al.  On Usage Metrics for Determining Authoritative Sites , 2000 .

[8]  Krithi Ramamritham,et al.  Enabling scalable online personalization on the Web , 2000, EC '00.

[9]  Ron Kohavi,et al.  Integrating e-commerce and data mining: architecture and challenges , 2000, Proceedings 2001 IEEE International Conference on Data Mining.

[10]  Matt Cutler E-metrics: tomorrow's business metrics today (invited talk) (abstract only) , 2000, KDD '00.

[11]  Philip K. Chan,et al.  A Non-Invasive Learning Approach to Building Web User Profiles , 1999 .

[12]  Anupam,et al.  Mining Web Access Logs Using Relational Competitive Fuzzy Clustering , 1999 .

[13]  Gediminas Adomavicius,et al.  User profiling in personalization applications through rule discovery and validation , 1999, KDD '99.

[14]  Jaideep Srivastava,et al.  Automatic personalization based on Web usage mining , 2000, CACM.

[15]  Oren Etzioni,et al.  Adaptive Web Sites: Automatically Synthesizing Web Pages , 1998, AAAI/IAAI.

[16]  James E. Pitkow,et al.  Summary of WWW characterizations , 1998, World Wide Web.

[17]  Philip S. Yu,et al.  Online Generation of Profile Association Rules , 1998, KDD.

[18]  P. Korgaonkar A Multivariate Analysis of Web Usage , 1999 .

[19]  Oren Etzioni,et al.  Adaptive Web Sites: an AI Challenge , 1997, IJCAI.

[20]  Michael D. Smith,et al.  Using Path Profiles to Predict HTTP Requests , 1998, Comput. Networks.

[21]  Bamshad Mobasher,et al.  Discovery of Aggregate Usage Profiles for Web Personalization , 2000 .