Introduction to the Special Issue ACM SIGKDD 2012

In this special issue of TKDD, we selected 6 of the best papers that were presented at the ACM SIGKDD 2012 conference. These invited papers went through the standard TKDD review process and the accepted versions have substantial new content beyond the original SIGKDD 2012 papers. We would like to thank all the authors who responded to our invitation as well as all the reviewers who helped evaluate the papers. The articles cover several emerging areas of data mining research and provide new insights into fundamental problems in the field. The first article deals with large time series data. The second article addresses clustering of interaction data. The next two articles deal with active learning, sampling, and entity matching. The last two articles are on the topics of multi-instance multi-label learning and ranking function learning. The article “Addressing Big Data Time Series: Mining Trillions of Time Series Subsequences Under Dynamic Time Warping” by Rakthanmanon et al. demonstrates the unintuitive fact that in large datasets it is possible to exactly search under Dynamic Time Warping much more quickly than the state-of-the-art Euclidean distance search algorithms. The authors demonstrated the performance of their method on the largest set of time series experiments ever attempted. In addition to mining massive datasets, this work also has implications for real-time monitoring of data streams, allowing for the handling of much faster arrival rates and/or the use of cheaper and lower powered devices than the existing methods. The article “PathSelClus: Integrating Meta-Path Selection with User-Guided Object Clustering in Heterogeneous Information Networks” by Sun et al. presents a new way to cluster objects in a large heterogeneous information network where multiple-typed objects are connected through paths of different semantics. To cluster such objects according to desired semantics, the authors propose to use “meta-path”, a path that connects object types via a sequence of relations, to control clustering with distinct semantics. A user provides a small set of object seeds for each cluster as guidance, and the system learns the weights for each meta-path that are consistent with the clustering result implied by the guidance and generates clusters under the learned weights of meta-paths. The article “Active Sampling for Entity Matching with Guarantees” by Bellare et al. presents a new active learning approach to training a classifier on label pairs of entities as either duplicates or nonduplicates. Instead of minimizing the 0-1 loss of the classifier, which is a metric unsuitable for entity matching, this article proposes to maximize the recall of the classifier under the constraint that its precision should be greater than a specified threshold. The novelty of the method is the provably sublinear label complexity, compared to prior work that has linear label complexity. Thus, this method requires fewer queries for labeling pairs of entities. The article “Batch Mode Active Sampling based on Marginal Probability Distribution Matching” by Chattopadhyay et al. presents a new batch-mode active learning that selects a set of query samples to minimize the difference in distribution between the labeled and the unlabeled data. The authors formulate this objective as an NP-hard integer programming optimization problem and provide two optimization techniques to solve this problem. The article “Instance Annotation for Multi-instance Multi-Label Learning” by Briggs et al. considers a supervised classification scenario where the objects to be classified