A Cluster-Based Data-Centric Model for Network-Aware Task Scheduling in Distributed Systems

Big Data processing architectures are now widely recognized as one of the most significant innovations in Computing in the last decade. Their enormous potential in collecting and processing huge volumes of data scattered throughout the Internet is opening the door to a new generation of fully distributed applications that, by leveraging the large amount of resources available on the network will be able to cope with very complex problems achieving performances never seen before. However, the Internet is known to have severe scalability limitations in moving very large quantities of data, and such limitations introduce the challenge of making efficient use of the computing and storage resources available on the network, in order to enable data-intensive applications to be executed effectively in such a complex distributed environment. This implies resource scheduling decisions which drive the execution of task towards the data by taking network load and capacity into consideration to maximize data access performance and reduce queueing and processing delays as possible. Accordingly, this work presents a data-centric meta-scheduling scheme for fully distributed Big Data processing architectures based on clustering techniques whose goal is aggregating tasks around storage repositories and driven by a new concept of “gravitational” attraction between the tasks and their data of interest. This scheme will benefit from heuristic criteria based on network awareness and advance resource reservation in order to suppress long delays in data transfer operations and result into an optimized use of data storage and runtime resources at the expense of a limited (polynomial) computational complexity.

[1]  Eli Dart Biological and Environmental Research Network Requirements , 2014 .

[2]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[3]  Kavitha Ranganathan,et al.  Simulation Studies of Computation and Data Scheduling Algorithms for Data Grids , 2003, Journal of Grid Computing.

[4]  Miron Livny,et al.  Stork: making data placement a first class citizen in the grid , 2004, 24th International Conference on Distributed Computing Systems, 2004. Proceedings..

[5]  Huan Liu,et al.  GridBatch: Cloud Computing for Large-Scale Data-Intensive Batch Applications , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[6]  P. D. Coddington,et al.  Scheduling Independent Tasks on Metacomputing Systems , 1999 .

[7]  Ciprian Dobre,et al.  Dynamic Meta-Scheduling Architecture Based on Monitoring in Distributed Systems , 2009, 2009 International Conference on Complex, Intelligent and Software Intensive Systems.

[8]  Viktor K. Prasanna,et al.  A unified resource scheduling framework for heterogeneous computing environments , 1999, Proceedings. Eighth Heterogeneous Computing Workshop (HCW'99).

[9]  Florian Schintke,et al.  A framework for self-optimizing grids using P2P components , 2003, 14th International Workshop on Database and Expert Systems Applications, 2003. Proceedings..

[10]  Francesco Palmieri,et al.  SPARK: A smart parametric online RWA algorithm , 2007, Journal of Communications and Networks.

[11]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[12]  Ramin Yahyapour,et al.  Design and evaluation of job scheduling strategies for grid computing , 2000, GRID.

[13]  Remzi H. Arpaci-Dusseau,et al.  Gathering at the Well: Creating Communities for Grid I/O , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[14]  Kavitha Ranganathan,et al.  Decoupling computation and data scheduling in distributed data-intensive applications , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[15]  F. Manea,et al.  Solving a combinatorial problem with network flows , 2005 .

[16]  Bodo Manthey,et al.  k-Means Has Polynomial Smoothed Complexity , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[17]  Jeffrey D. Ullman,et al.  NP-Complete Scheduling Problems , 1975, J. Comput. Syst. Sci..

[18]  P. Sadayappan,et al.  Distributed job scheduling on computational Grids using multiple simultaneous requests , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[19]  Miron Livny,et al.  Harnessing the Capacity of Computational Grids for High Energy Physics , 2000 .

[20]  Floriano Zini,et al.  Evaluating scheduling and replica optimisation strategies in OptorSim , 2003, Proceedings. First Latin American Web Congress.

[21]  Sanjay Ranka,et al.  Scheduling Bulk File Transfers with Start and End Times , 2007, Sixth IEEE International Symposium on Network Computing and Applications (NCA 2007).

[22]  Francesco Palmieri,et al.  Network-aware scheduling for real-time execution support in data-intensive optical Grids , 2009, Future Gener. Comput. Syst..

[23]  Francine Berman,et al.  The AppLeS Parameter Sweep Template: User-Level Middleware for the Grid , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[24]  Michael Thomas,et al.  Data Intensive and Network Aware (DIANA) Grid Scheduling , 2007, Journal of Grid Computing.

[25]  Edward G. Coffman,et al.  Scheduling File Transfers , 1985, SIAM J. Comput..

[26]  Tevfik Kosar A new paradigm in data intensive computing: Stork and the data-aware schedulers , 2006, 2006 IEEE Challenges of Large Applications in Distributed Environments.

[27]  Pangfeng Liu,et al.  Job Scheduling Techniques for Distributed Systems with Heterogeneous Processor Cardinality , 2009, 2009 10th International Symposium on Pervasive Systems, Algorithms, and Networks.

[28]  Michael I. Jordan,et al.  Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM.

[29]  A. Amin,et al.  Scheduling real time parallel structures on cluster computing with possible processor failures , 2004, Proceedings. ISCC 2004. Ninth International Symposium on Computers And Communications (IEEE Cat. No.04TH8769).

[30]  Farhad Shahrokhi,et al.  The maximum concurrent flow problem , 1990, JACM.