Performance Modeling of Big Data-Oriented Architectures

Big Data applications provide new, disruptive tools to advance our knowledge about the mechanisms that characterize complex aspects of reality. Be it a high energy physics experiment or an analysis of social networks data, the strength of the approach is the availability of a huge richness of data; but, at the same time, it is also the main challenge, as this abundance of information must be processed at a bearable cost per information unit and requires higher scale systems to provide enough computing power. This is only possible if the Big Data platform is properly managed and exploited according to the needs of the applications, and a fundamental premise is the capability for a proper performance evaluation of the platform. In this chapter, we provide a glance over the main aspects of performance evaluation for Big Data architectures, together with some examples of model-based evaluation, in order to show how it is possible to characterize big scale architectures to support their correct management, and suggest a methodological coarse grain solution to exploit different conceptual and technical tools to integrate a flexible, model-based, performance analysis supported approach to Big Data systems design, capable of scaling up easily in the core evaluation stage means of Markovian agents.

[1]  Mauro Iacono,et al.  Improving reliability and performances in large scale distributed applications with erasure codes and replication , 2016, Future Gener. Comput. Syst..

[2]  Francesco Palmieri,et al.  Towards a federated Metropolitan Area Grid environment: The SCoPE network-aware infrastructure , 2010, Future Gener. Comput. Syst..

[3]  Maozhen Li,et al.  HSim: A MapReduce simulator in enabling Cloud Computing , 2013, Future Gener. Comput. Syst..

[4]  Alma Riska,et al.  Fast Eventual Consistency with Performance Guarantees for Distributed Storage , 2012, 2012 32nd International Conference on Distributed Computing Systems Workshops.

[5]  Jose Renato Santos,et al.  JustRunIt: Experiment-Based Management of Virtualized Data Centers , 2009, USENIX Annual Technical Conference.

[6]  Anand Sivasubramaniam,et al.  On characterizing bandwidth requirements of parallel applications , 1995, SIGMETRICS '95/PERFORMANCE '95.

[7]  David A. Cieslak,et al.  The Need to Consider Hardware Selection when Designing Big Data Applications Supported by Metadata , 2014 .

[8]  Mauro Iacono,et al.  Modeling performances of concurrent big data applications , 2015, Softw. Pract. Exp..

[9]  Abdulhalim Dandoush,et al.  Simulation analysis of download and recovery processes in P2P storage systems , 2009, 2009 21st International Teletraffic Congress.

[10]  Mihai Budiu,et al.  Hunting for Problems with Artemis , 2008, WASL.

[11]  Maozhen Li,et al.  MRSim: A discrete event based MapReduce simulator , 2010, 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery.

[12]  T. Kurtz Approximation of Population Processes , 1987 .

[13]  Verena Kantere,et al.  I/O Performance Modeling for Big Data Applications over Cloud Infrastructures , 2015, 2015 IEEE International Conference on Cloud Engineering.

[14]  Lei Yu,et al.  SimMapReduce: A Simulator for Modeling MapReduce Framework , 2011, 2011 Fifth FTRA International Conference on Multimedia and Ubiquitous Engineering.

[15]  Francesco Palmieri,et al.  Enhanced Network Support for Scalable Computing Clouds , 2010, Cloud Computing.

[16]  Brian Tierney,et al.  Efficient data transfer protocols for big data , 2012, 2012 IEEE 8th International Conference on E-Science.

[17]  Fabrizio Petrini,et al.  Predictive Performance and Scalability Modeling of a Large-Scale Application , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[18]  Archana Ganapathi,et al.  Towards Understanding Cloud Performance Tradeoffs Using Statistical Workload Analysis and Replay , 2010 .

[19]  Hong Jiang,et al.  A Scalable Inline Cluster Deduplication Framework for Big Data Protection , 2012, Middleware.

[20]  Dimitris S. Papailiopoulos,et al.  XORing Elephants: Novel Erasure Codes for Big Data , 2013, Proc. VLDB Endow..

[21]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[22]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[23]  Avinoam Kolodny,et al.  Distributed adaptive routing for big-data applications running on Data Center Networks , 2012, 2012 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS).

[24]  Pietro Piazzolla,et al.  Performance Evaluation of NoSQL Databases , 2014, EPEW.

[25]  Rajeev Gandhi,et al.  An Analysis of Traces from a Production MapReduce Cluster , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[26]  Gregory R. Ganger,et al.  Applying Performance Models to Understand Data-Intensive Computing Efficiency , 2010 .

[27]  Mauro Iacono,et al.  Modeling Replication and Erasure Coding in Large Scale Distributed Storage Systems Based on CEPH , 2016 .

[28]  Peter A. Dinda,et al.  An evaluation of linear models for host load prediction , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[29]  Mauro Iacono,et al.  Modeling apache hive based applications in big data architectures , 2013, VALUETOOLS.

[30]  Alfredo De Santis,et al.  A Cluster-Based Data-Centric Model for Network-Aware Task Scheduling in Distributed Systems , 2013, International Journal of Parallel Programming.

[31]  Gueyoung Jung,et al.  Synchronous Parallel Processing of Big-Data Analytics Services to Optimize Performance in Federated Clouds , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[32]  John Kubiatowicz,et al.  Erasure Coding Vs. Replication: A Quantitative Comparison , 2002, IPTPS.

[33]  Yuichi Sato,et al.  Erasure Codes with Small Overhead Factor and Their Distributed Storage Applications , 2007, 2007 41st Annual Conference on Information Sciences and Systems.

[34]  Mauro Iacono,et al.  Exploiting mean field analysis to model performances of big data architectures , 2014, Future Gener. Comput. Syst..

[35]  Michael I. Jordan,et al.  Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM 2011.

[36]  Rodrigo Rodrigues,et al.  High Availability in DHTs: Erasure Coding vs. Replication , 2005, IPTPS.

[37]  Gregory R. Ganger,et al.  Agility and Performance in Elastic Distributed Storage , 2014, TOS.

[38]  Paul Zikopoulos,et al.  Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data , 2011 .

[39]  Christian Esposito,et al.  Interconnecting Federated Clouds by Using Publish-Subscribe Service , 2013, Cluster Computing.

[40]  Mauro Iacono,et al.  Performance evaluation of NoSQL big-data applications using multi-formalism models , 2014, Future Gener. Comput. Syst..

[41]  Marcos K. Aguilera,et al.  Using erasure codes efficiently for storage in a distributed system , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[42]  Xiaofeng Gao,et al.  A Performance Prediction Framework for Scientific Applications , 2003, International Conference on Computational Science.

[43]  Oscar H. Ibarra,et al.  Adaptive Partitioning and Scheduling for Enhancing WWW Application Performance , 1998, J. Parallel Distributed Comput..

[44]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[45]  Brian Armstrong,et al.  Performance forecasting: towards a methodology for characterizing large computational applications , 1998, Proceedings. 1998 International Conference on Parallel Processing (Cat. No.98EX205).

[46]  Shivnath Babu,et al.  Tuning Database Configuration Parameters with iTuned , 2009, Proc. VLDB Endow..

[47]  Miklós Telek,et al.  Analysis of Large Scale Interacting Systems by Mean Field Method , 2008, 2008 Fifth International Conference on Quantitative Evaluation of Systems.

[48]  Kishor S. Trivedi Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .

[49]  Michael I. Jordan,et al.  Statistical Machine Learning Makes Automatic Control Practical for Internet Datacenters , 2009, HotCloud.

[50]  Mario A. R. Dantas,et al.  A survey into performance and energy efficiency in HPC, cloud and big data environments , 2014, Int. J. Netw. Virtual Organisations.

[51]  Philippe Robert,et al.  Scattering and Placing Data Replicas to Enhance Long-Term Durability , 2015, 2015 IEEE 14th International Symposium on Network Computing and Applications.

[52]  Rajeev Gandhi,et al.  Kahuna: Problem diagnosis for Mapreduce-based cloud computing environments , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[53]  Graham R. Nudd,et al.  A Layered Approach to Parallel Software Performance Prediction: A Case Study , 1994, EUROSIM.

[54]  Herodotos Herodotou,et al.  Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[55]  Wei Chen,et al.  On the Impact of Replica Placement to the Reliability of Distributed Brick Storage Systems , 2005, 25th IEEE International Conference on Distributed Computing Systems (ICDCS'05).

[56]  Munam Ali Shah,et al.  Energy efficiency in big data complex systems: a comprehensive survey of modern energy saving techniques , 2015, Complex Adapt. Syst. Model..

[57]  Samuel Madden,et al.  From Databases to Big Data , 2012, IEEE Internet Comput..

[58]  Francesco Palmieri,et al.  An HLA‐based framework for simulation of large‐scale critical systems , 2016, Concurr. Comput. Pract. Exp..

[59]  Guanying Wang,et al.  A simulation approach to evaluating design decisions in MapReduce setups , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[60]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[61]  Mauro Iacono,et al.  A Performance Modeling Language For Big Data Architectures , 2013, ECMS.

[62]  Vaneet Aggarwal,et al.  Joint latency and cost optimization for erasurecoded data center storage , 2014, PERV.

[63]  Alexandros Labrinidis,et al.  Challenges and Opportunities with Big Data , 2012, Proc. VLDB Endow..

[64]  Mauro Iacono,et al.  Modeling and Evaluating the Effects of Big Data Storage Resource Allocation in Global Scale Cloud Architectures , 2016, Int. J. Data Warehous. Min..

[65]  Jing Zhao,et al.  Benchmarking cloud-based data management systems , 2010, CloudDB '10.

[66]  Mauro Iacono,et al.  Modeling and analysis of performances for concurrent multithread applications on multicore and graphics processing unit systems , 2016, Concurr. Comput. Pract. Exp..

[67]  Guihai Chen,et al.  Redundancy Schemes for High Availability in DHTs , 2005, ISPA.