Special issue on best of SIGKDD 2011

This special issue includes five articles that are representative of the best works presented in the research track of the ACM SIGKDD 2011 conference. The annual ACM SIGKDD conference is the leading international forum for data mining researchers and practitioners from academia, industry, and government to share their research results, explore new ideas, and exchange experiences. Based on reviewer scores and recommendations, about a dozen papers were considered for the Best Paper Award (Research Track) by a committee chaired as members. This same committee then selected a subset of these papers for which the authors were invited to submit an extended version to be further reviewed for this special issue. This exercise has resulted in the five articles included in this issue, covering a variety of topics in data mining as briefly described. The first article by S. Kaufman et al., entitled " Leakage in Data Mining: Formulation , Detection, and Avoidance " , examines the problem of leakage, wherein a faulty construction of a dataset causes information about the target variable to creep into the data even though such a leakage would not have occurred in an actual, real-life setting. Several prominent competitions have suffered from leakage problems. The authors propose ways of detecting leakage as well as avoiding it. The second article, " Summarizing Data Succinctly with the Most Informative Item-sets " by M. Mampaey et al., presents an innovative, maximum-entropy-based approach to determine the itemsets that capture the essence of a dataset in a succinct manner. This approach has the potential to significantly increase the utility of association rules, as a na¨ıve application of association rule mining often results in a large number of relatively uninformative rules. " Triangle Listing in Massive Networks " by S. Chu and J. Cheng provides an I/O-efficient method for finding all triangles in a graph. For very large graphs that do not fit in main memory, this method is able to minimize random disk access, leading to a scalable and memory-efficient solution. " Multisource Domain Adaptation and Its Application to Early Detection of Fatigue " by R. Chattopadhyay introduces an optimization-based framework for transfer learning that can adaptively incorporate information from patients that are somewhat similar (the multiple sources), while developing a predictive model to detect muscle fatigue in a new patient. Though presented in the context of a specific application, the methodology is quite general and has the potential to …