Discovery Science

The standard model for association rule mining involves a set of items and a set of baskets The baskets contain items that some customer has purchased at the same time The problem is to nd pairs or perhaps larger sets of items that frequently appear together in baskets We mention the principal approaches to e cient large scale discovery of the frequent itemsets including the a priori algorithm improvements using hashing and one and two pass probabilistic algorithms for nding frequent itemsets We then turn to techniques for nding highly corre lated but infrequent pairs of items These notes were written for CS at Stanford University and are reprinted by permission of the author http www db stanford edu ullman mining mining html gives you access to the entire set of notes including additional citations and on line links Association Rules and Frequent Itemsets The market basket problem assumes we have some large number of items e g bread or milk Customers ll their market baskets with some subset of the items and we get to know what items people buy together even if we don t know who they are Marketers use this information to position items and control the way a typical customer traverses the store In addition to the marketing application the same sort of question has the following uses Baskets documents items words Words appearing frequently together in documents may represent phrases or linked concepts One possible appli cation is intelligence gathering Baskets sentences items documents Two documents with many of the same sentences could represent plagiarism or mirror sites on the Web Goals for Market Basket Mining Association rules are statements of the form fX X Xng Y mean ing that if we nd all of X X Xn in the market basket then we have a good chance of nding Y The probability of nding Y given fX Xng is called the con dence of the rule We normally would accept only rules that had con dence above a certain threshold We may also ask that the con dence be signi cantly higher than it would be if items were placed at random S. Arikawa and S. Morishita (Eds.): DS 2000, LNAI 1967, pp. 1-14, 2000. c Springer-Verlag Berlin Heidelberg 2000 into baskets For example we might nd a rule like fmilk butterg bread simply because a lot of people buy bread However the beer diapers story asserts that the rule fdiapersg beer holds with con dence signi cantly greater than the fraction of baskets that contain beer Causality Ideally we would like to know that in an association rule the pres ence of X Xn actually causes Y to be bought However causality is an elusive concept Nevertheless for market basket data the following test suggests what causality means If we lower the price of diapers and raise the price of beer we can lure diaper buyers who are more likely to pick up beer while in the store thus covering our losses on the diapers That strategy works because diapers causes beer However working it the other way round running a sale on beer and raising the price of diapers will not result in beer buyers buying diapers in any great numbers and we lose money Frequent itemsets In many but not all situations we only care about as sociation rules or causalities involving sets of items that appear frequently in baskets For example we cannot run a good marketing strategy involving items that almost no one buys anyway Thus much data mining starts with the assumption that we only care about sets of items with high support i e they appear together in many baskets We then nd association rules or causalities only involving a high support set of items i e fX Xn Y g must appear in at least a certain percent of the baskets called the support threshold Framework for Frequent Itemset Mining We use the term frequent itemset for a set S that appears in at least fraction s of the baskets where s is some chosen constant typically or We assume data is too large to t in main memory Either it is stored in a re lational database say as a relation Baskets BID item or as a at le of records of the form BID item item itemn When evaluating the running time of algorithms we Count the number of passes through the data Since the principal cost is often the time it takes to read data from disk the number of times we need to read each datum is often the best measure of running time of the algorithm There is a key principle called monotonicity or the a priori trick that helps us nd frequent itemsets If a set of items S is frequent i e appears in at least fraction s of the baskets then every subset of S is also frequent The famous and possibly apocraphal discovery that people who buy diapers are unusually likely to buy beer 2 Jeffrey D. Ullman

[1]  Heikki Mannila,et al.  Fast Discovery of Association Rules in Large Databases , 1996, Knowledge Discovery and Data Mining.

[2]  T. Havránek,et al.  Mechanizing Hypothesis Formation: Mathematical Foundations for a General Theory , 1978 .

[3]  Jaroslava Halova,et al.  Coping the Challenge of Mutagenes Discovery with GUHA+/- for Windows , 2000, Discovery Science.

[4]  Hiroki Arimura,et al.  Discovering Unordered and Ordered Phrase Association Patterns for Text Mining , 2000, PAKDD.

[5]  Jan M. Zytkow,et al.  From Contingency Tables to Various Forms of Knowledge in Databases , 1996, Advances in Knowledge Discovery and Data Mining.

[6]  Wojciech Rytter,et al.  Text Algorithms , 1994 .

[7]  Heikki Mannila,et al.  Discovering Frequent Episodes in Sequences , 1995, KDD.

[8]  A. Debnath,et al.  Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity. , 1991, Journal of medicinal chemistry.

[9]  Yonatan Aumann,et al.  Maximal Association Rules: A New Tool for Mining for Keyword Co-Occurrences in Document Collections , 1997, KDD.

[10]  Ming Li,et al.  On the complexity of learning strings and sequences , 1991, COLT '91.

[11]  Alexandru T. Balaban,et al.  Steric Fit in Quantitative Structure-Activity Relations , 1980 .

[12]  Jaroslava Halova,et al.  Mutagenes Discovery Using PC GUHA Software System , 1999, Discovery Science.

[13]  Ayumi Shinohara,et al.  Knowledge Acquisition from Amino Acid Sequences by Machine Learning System BONSAI , 1992 .

[14]  瀬々 潤,et al.  Traversing Itemset Lattices with Statistical Metric Pruning (小特集 「発見科学」及び一般演題) , 2000 .

[15]  Borivoj Melichar,et al.  Directed Acyclic Subsequence Graph , 1998, Stringology.

[16]  Kaizhong Zhang,et al.  Combinatorial pattern discovery for scientific data: some preliminary results , 1994, SIGMOD '94.

[17]  Ayumi Shinohara,et al.  Online construction of subsequence automata for multiple texts , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.

[18]  Jan Rauch,et al.  Classes of Four-Fold Table Quantifiers , 1998, PKDD.