Similarity Measure Selection for Clustering Time Series Databases

In the past few years, clustering has become a popular task associated with time series. The choice of a suitable distance measure is crucial to the clustering process and, given the vast number of distance measures for time series available in the literature and their diverse characteristics, this selection is not straightforward. With the objective of simplifying this task, we propose a multi-label classification framework that provides the means to automatically select the most suitable distance measures for clustering a time series database. This classifier is based on a novel collection of characteristics that describe the main features of the time series databases and provide the predictive information necessary to discriminate between a set of distance measures. In order to test the validity of this classifier, we conduct a complete set of experiments using both synthetic and real time series databases and a set of five common distance measures. The positive results obtained by the designed classification framework for various performance measures indicate that the proposed methodology is useful to simplify the process of distance selection in time series clustering tasks.

[1]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[2]  Mia Hubert,et al.  An adjusted boxplot for skewed distributions , 2008, Comput. Stat. Data Anal..

[3]  Nuno Constantino Castro,et al.  Time Series Data Mining , 2009, Encyclopedia of Database Systems.

[4]  J. Kurths,et al.  Comparison of correlation analysis techniques for irregularly sampled time series , 2011 .

[5]  Didier Stricker,et al.  Exploring and extending the boundaries of physical activity recognition , 2011, 2011 IEEE International Conference on Systems, Man, and Cybernetics.

[6]  Joan Serrà,et al.  An empirical evaluation of similarity measures for time series classification , 2014, Knowl. Based Syst..

[7]  Eamonn J. Keogh,et al.  A Complexity-Invariant Distance Measure for Time Series , 2011, SDM.

[8]  Pierre Geurts,et al.  Contributions to decision tree induction: bias/variance tradeoff and time series classification , 2002 .

[9]  Silke Wagner,et al.  Comparing Clusterings - An Overview , 2007 .

[10]  Dirk Kohler,et al.  A comparison of denoising methods for one dimensional time series , 2005 .

[11]  Eamonn J. Keogh,et al.  Experimental comparison of representation methods and distance measures for time series data , 2012, Data Mining and Knowledge Discovery.

[12]  Lei Chen,et al.  Robust and fast similarity search for moving object trajectories , 2005, SIGMOD '05.

[13]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[14]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[15]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[16]  Xiaozhe Wang,et al.  Characteristic-Based Clustering for Time Series Data , 2006, Data Mining and Knowledge Discovery.

[17]  Didier Stricker,et al.  Towards global aerobic activity monitoring , 2011, PETRA '11.

[18]  Luís Torgo,et al.  Data Mining with R: Learning with Case Studies , 2010 .

[19]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[20]  T. Warren Liao,et al.  Clustering of time series data - a survey , 2005, Pattern Recognit..

[21]  Grigorios Tsoumakas,et al.  Random k -Labelsets: An Ensemble Method for Multilabel Classification , 2007, ECML.

[22]  Geoff Holmes,et al.  Classifier chains for multi-label classification , 2009, Machine Learning.

[23]  Min-Ling Zhang,et al.  A Review on Multi-Label Learning Algorithms , 2014, IEEE Transactions on Knowledge and Data Engineering.

[24]  Nick S. Jones,et al.  Highly Comparative Feature-Based Time-Series Classification , 2014, IEEE Transactions on Knowledge and Data Engineering.

[25]  Duc Truong Pham,et al.  Control chart pattern recognition using a new type of self-organizing neural network , 1998 .

[26]  Hans-Peter Kriegel,et al.  Similarity Search on Time Series Based on Threshold Queries , 2006, EDBT.

[27]  M. Friedman A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[28]  Irma J. Terpenning,et al.  STL : A Seasonal-Trend Decomposition Procedure Based on Loess , 1990 .