New probabilistic interest measures for association rules

Mining association rules is an important technique for discovering meaningful patterns in transaction databases. Many different measures of interestingness have been proposed for association rules. However, these measures fail to take the probabilistic properties of the mined data into account. We start this paper with presenting a simple probabilistic framework for transaction data which can be used to simulate transaction data when no associations are present. We use such data and a real-world database from a grocery outlet to explore the behavior of confidence and lift, two popular interest measures used for rule mining. The results show that confidence is systematically influenced by the frequency of the items in the left hand side of rules and that lift performs poorly to filter random noise in transaction data. Based on the probabilistic framework we develop two new interest measures, hyper-lift and hyper-confidence, which can be used to filter or order mined association rules. The new measures show significantly better performance than lift for applications where spurious rules are problematic.

[1]  A. W. Kemp,et al.  Univariate Discrete Distributions , 1993 .

[2]  Kurt Hornik,et al.  Implications of Probabilistic Data Modeling for Mining Association Rules , 2005, GfKl.

[3]  Jean-Marc Adamo,et al.  Data Mining for Association Rules and Sequential Patterns , 2000, Springer New York.

[4]  Kurt Hornik,et al.  Introduction to arules — Mining Association Rules and Frequent Item Sets , 2006 .

[5]  Hui Xiong,et al.  Mining strong affinity association patterns in data sets with skewed support distribution , 2003, Third IEEE International Conference on Data Mining.

[6]  J. Shaffer Multiple Hypothesis Testing , 1995 .

[7]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[8]  Roberto J. Bayardo,et al.  Mining the most interesting rules , 1999, KDD '99.

[9]  Wynne Hsu,et al.  Mining association rules with multiple minimum supports , 1999, KDD '99.

[10]  Philip S. Yu,et al.  A new framework for itemset generation , 1998, PODS '98.

[11]  Heikki Mannila,et al.  Probabilistic modeling of transaction data with applications to profiling, visualization, and prediction , 2001, KDD '01.

[12]  J. I The Design of Experiments , 1936, Nature.

[13]  H. Bock Probabilistic models in cluster analysis , 1996 .

[14]  HippJochen,et al.  Algorithms for association rule mining a general survey and comparison , 2000 .

[15]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[16]  Rajeev Motwani,et al.  Beyond Market Baskets: Generalizing Association Rules to Dependence Rules , 1998, Data Mining and Knowledge Discovery.

[17]  H. Hruschka,et al.  Cross-category sales promotion effects , 1999 .

[18]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[19]  Heikki Mannila,et al.  Mixture Models and Frequent Sets: Combining Global and Local Methods for 0-1 Data , 2003, SDM.

[20]  Ferenc Bodon,et al.  A fast APRIORI implementation , 2003, FIMI.

[21]  Heikki Mannila,et al.  Efficient Algorithms for Discovering Association Rules , 1994, KDD Workshop.

[22]  William DuMouchel,et al.  Empirical bayes screening for multi-item associations , 2001, KDD '01.

[23]  Bart Goethals,et al.  Advances in frequent itemset mining implementations: report on FIMI'03 , 2004, SKDD.

[24]  R. Betancourt,et al.  Demand Complementarities, Household Production, and Retail Assortments , 1990 .

[25]  Michael Hahsler,et al.  A Model-Based Frequency Constraint for Mining Associations from Transaction Data , 2006, Data Mining and Knowledge Discovery.

[26]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[27]  Heikki Mannila,et al.  Beyond Independence: Probabilistic Models for Query Approximation on Binary Transaction Data , 2003, IEEE Trans. Knowl. Data Eng..

[28]  J. Davis Univariate Discrete Distributions , 2006 .

[29]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[30]  Stephen E. Fienberg,et al.  Testing Statistical Hypotheses , 2005 .

[31]  Johannes Fürnkranz,et al.  ROC ‘n’ Rule Learning—Towards a Better Understanding of Covering Algorithms , 2005, Machine Learning.