Scalable regression tree learning on Hadoop using OpenPlanet

As scientific and engineering domains attempt to effectively analyze the deluge of data from sensors and instruments, machine learning is becoming a key data mining tool to build prediction models. Regression tree is a popular learning model that combines decision trees and linear regression to forecast numerical target variables based on a set of input features. MapReduce is well suited for such data intensive learning applications, and a proprietary regression tree algorithm, PLANET, has been proposed earlier on MapReduce. In this paper, we describe an open source implementation of this algorithm, OpenPlanet, on the Hadoop framework using a hybrid approach. We evaluate the performance of OpenPlanet using real world datasets from the Smart Power Grid domain for energy use forecasting, and propose tuning strategies of Hadoop parameters to improve the performance of the default configuration by 75% for a training dataset of 17 million tuples on a 64-core Hadoop cluster on FutureGrid.

[1]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[2]  Yogesh L. Simmhan,et al.  Toward data-driven demand-response optimization in a campus microgrid , 2011, BuildSys '11.

[3]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[4]  S. Krishnan myHadoop-Hadoop-on-Demand on Traditional HPC Resources , 2004 .

[5]  Ali Ipakchi Implementing the Smart Grid : Enterprise Information Integration , 2007 .

[6]  Yogesh L. Simmhan,et al.  An Informatics Approach to Demand Response Optimization in Smart Grids , 2011 .

[7]  Chng Eng Siong,et al.  Hadoop framework: impact of data organization on performance , 2013, Softw. Pract. Exp..

[8]  Jie Li,et al.  eScience in the cloud: A MODIS satellite data reprojection and reduction pipeline in the Windows Azure platform , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[9]  Anthony J. G. Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery [Point of View] , 2011 .

[10]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[11]  Johannes Gehrke,et al.  SECRET: a scalable linear regression tree algorithm , 2002, KDD.

[12]  David G. Stork,et al.  Pattern Classification , 1973 .

[13]  Yogesh L. Simmhan,et al.  Improving Energy Use Forecast for Campus Micro-grids Using Indirect Indicators , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[14]  Roberto J. Bayardo,et al.  PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[15]  Joseph M. Hellerstein,et al.  GraphLab: A New Framework For Parallel Machine Learning , 2010, UAI.

[16]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[17]  Eduardo Serrano,et al.  LSST: From Science Drivers to Reference Design and Anticipated Data Products , 2008, The Astrophysical Journal.