Addressing data veracity in big data applications

Big data applications such as in smart electric grids, transportation, and remote environment monitoring involve geographically dispersed sensors that periodically send back information to central nodes. In many cases, data from sensors is not available at central nodes at a frequency that is required for real-time modeling and decision-making. This may be due to physical limitations of the transmission networks, or due to consumers limiting frequent transmission of data from sensors located at their premises for security and privacy concerns. Such scenarios lead to partial data problem and raise the issue of data veracity in big data applications. We describe a novel solution to the problem of making short term predictions (up to a few hours ahead) in absence of real-time data from sensors in Smart Grid. A key implication of our work is that by using real-time data from only a small subset of influential sensors, we are able to make predictions for all sensors. We thus reduce the communication complexity involved in transmitting sensory data in Smart Grids. We use real-world electricity consumption data from smart meters to empirically demonstrate the usefulness of our method. Our dataset consists of data collected at 15-min intervals from 170 smart meters in the USC Microgrid for 7 years, totaling 41,697,600 data points.

[1]  Yogesh L. Simmhan,et al.  Holistic Measures for Evaluating Prediction Models in Smart Grids , 2014, IEEE Transactions on Knowledge and Data Engineering.

[2]  Yogesh L. Simmhan,et al.  Cloud-Based Software Platform for Big Data Analytics in Smart Grids , 2013, Computing in Science & Engineering.

[3]  Ugur Demiryurek,et al.  Utilizing Real-World Transportation Data for Accurate Traffic Prediction , 2012, 2012 IEEE 12th International Conference on Data Mining.

[4]  Yan Liu,et al.  Spatial-temporal causal modeling for climate change attribution , 2009, KDD.

[5]  D. Heckerman,et al.  Autoregressive Tree Models for Time-Series Analysis , 2002, SDM.

[6]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[7]  David M Kreindler,et al.  The effects of the irregular sample and missing data in time series analysis. , 2006, Nonlinear dynamics, psychology, and life sciences.

[8]  Ian Richardson,et al.  Smart meter data: Balancing consumer privacy concerns with legitimate applications , 2012 .

[9]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[10]  Eric Bouillet,et al.  MiSTRAL: An architecture for low-latency analytics on MasSive time series , 2013, 2013 IEEE International Conference on Big Data.

[11]  Natasha Balac,et al.  Large Scale predictive analytics for real-time energy management , 2013, 2013 IEEE International Conference on Big Data.

[12]  F. Bouhafs,et al.  Links to the Future: Communication Requirements and Challenges in the Smart Grid , 2012, IEEE Power and Energy Magazine.

[13]  Cees T. A. M. de Laat,et al.  Addressing big data issues in Scientific Data Infrastructure , 2013, 2013 International Conference on Collaboration Technologies and Systems (CTS).

[14]  Antonio Ortega,et al.  A distributed wavelet compression algorithm for wireless multihop sensor networks using lifting , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[15]  Simon A. Dobson,et al.  Compression in wireless sensor networks , 2013 .

[16]  Patrick D. McDaniel,et al.  Security and Privacy Challenges in the Smart Grid , 2009, IEEE Security & Privacy.