Clustering of Data Streams With Dynamic Gaussian Mixture Models: An IoT Application in Industrial Processes

In industrial Internet of Things applications with sensors sending dynamic process data at high speed, producing actionable insights at the right time is challenging. A key problem concerns processing a large amount of data, while the underlying dynamic phenomena related to the machine is possibly evolving over time due to factors, such as degradation. This makes any actionable model become obsolete and necessary to be updated. To cope with this problem, in this paper we propose a new unsupervised learning algorithm based on Gaussian mixture models called Gaussian-based dynamic probabilistic clustering (GDPC) mainly based on integrating and adapting three well known algorithms for use in dynamic scenarios: the expectation-maximization (EM) algorithm to estimate the model parameters and the Page-Hinkley test and Chernoff bound to detect concept drifts. Unlike other unsupervised methods, the model induced by the GDPC provides the membership probabilities of each instance to each cluster. This allows to determine, through a Brier score analysis, the robustness of the instance assignment and its evolution each time a concept drift is detected. Also, the algorithm works with very little data and significantly less computing power being able to decide whether (and when) to change the model. The algorithm is tested using synthetic data and data streams from an industrial testbed, where different operational states are automatically identified, giving good results in terms of classification accuracy, sensitivity, and specificity.

[1]  D. Hinkley Inference about the change-point from cumulative sum tests , 1971 .

[2]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[3]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[4]  H. Mouss,et al.  Test of Page-Hinckley, an approach for fault detection in an agro-alimentary production system , 2004, 2004 5th Asian Control Conference (IEEE Cat. No.04EX904).

[5]  María Bermúdez-Edo,et al.  On the Effect of Adaptive and Nonadaptive Analysis of Time-Series Sensory Data , 2016, IEEE Internet of Things Journal.

[6]  João Gama,et al.  A Study on Change Detection Methods , 2009 .

[7]  Yongheng Wang,et al.  A Streaming Data Prediction Method Based on Evolving Bayesian Network , 2017, APWeb/WAIM.

[8]  Robi Polikar,et al.  Adding adaptive intelligence to sensor systems with MASS , 2017, 2017 IEEE Sensors Applications Symposium (SAS).

[9]  Ira Assent,et al.  The ClusTree: indexing micro-clusters for anytime stream mining , 2011, Knowledge and Information Systems.

[10]  Tian Zhang,et al.  BIRCH: A New Data Clustering Algorithm and Its Applications , 1997, Data Mining and Knowledge Discovery.

[11]  Charles Elkan,et al.  Scalability for clustering algorithms revisited , 2000, SKDD.

[12]  Marvin Minsky,et al.  Steps toward Artificial Intelligence , 1995, Proceedings of the IRE.

[13]  Sepideh Pashami,et al.  Mode tracking using multiple data streams , 2018, Inf. Fusion.

[14]  R. Keith Mobley,et al.  An introduction to predictive maintenance , 1989 .

[15]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[16]  Felix Wortmann,et al.  Internet of Things , 2015, Business & Information Systems Engineering.

[17]  Christian Sohler,et al.  StreamKM++: A clustering algorithm for data streams , 2010, JEAL.

[18]  João Gama,et al.  Clustering distributed sensor data streams using local processing and reduced communication , 2011, Intell. Data Anal..

[19]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[20]  Christian Brecher,et al.  Industrial Internet of Things and Cyber Manufacturing Systems , 2017 .

[21]  Ping Chen,et al.  Tracking Clusters in Evolving Data Sets , 2001, FLAIRS Conference.

[22]  Rahim Tafazolli,et al.  Adaptive Clustering for Dynamic IoT Data Streams , 2017, IEEE Internet of Things Journal.

[23]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[24]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[25]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[26]  H. Akaike A new look at the statistical model identification , 1974 .

[27]  João Gama,et al.  Hierarchical Clustering of Time-Series Data Streams , 2008, IEEE Transactions on Knowledge and Data Engineering.

[28]  Osamu Watanabe Simple Sampling Techniques for Discovery Science , 2000 .

[29]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[30]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[31]  Concha Bielza,et al.  Mining multi-dimensional concept-drifting data streams using Bayesian network classifiers , 2016, Intell. Data Anal..

[32]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Data stream clustering: A survey , 2013, CSUR.

[33]  Concha Bielza,et al.  Machine Learning-based CPS for Clustering High throughput Machining Cycle Conditions , 2017 .

[34]  Aoying Zhou,et al.  Tracking clusters in evolving data streams over sliding windows , 2008, Knowledge and Information Systems.

[35]  Jesús S. Aguilar-Ruiz,et al.  Knowledge discovery from data streams , 2009, Intell. Data Anal..