On the Optimization of GLite-Based Job Submission

A Grid is a very dynamic, complex and heterogeneous system, whose reliability can be adversely conditioned by several different factors such as communications and hardware faults, middleware bugs or wrong configurations due to human errors. As the infrastructure scales, spanning a large number of sites, each hosting hundreds or thousands of hosts/resources, the occurrence of runtime faults following job submission becomes a very frequent and phenomenon. Therefore, fault avoidance becomes a fundamental aim in modern Grids since the dependability of individual resources spread upon widely distributed computing infrastructures and often used outside of their native organizational boundaries, cannot be guaranteed in any systematic way. Accordingly, we propose a simple job optimization solution based on a user-driven fault avoidance strategy. Such strategy starts from the introduction within the grid information system of several on-line service-monitoring metrics that can be used as specific hints to the workload management system for driving resource discovery operations according to a fault-free resource-scheduling plan. This solution, whose main goal is to minimize the execution time by avoiding execution failures, demonstrated to be very effective in incrementing both the user perceivable quality and the overall grid performance.

[1]  Soonwook Hwang,et al.  Grid workflow: a flexible failure handling framework for the grid , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[2]  Jon B. Weissman Fault tolerant computing on the grid: what are my options? , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[3]  Gregor von Laszewski,et al.  A fault detection service for wide area distributed computations , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[4]  Jemal H. Abawajy,et al.  Fault-tolerant Grid Resource Management Infrastructure , 2004, Neural Parallel Sci. Comput..

[5]  Jason Lee,et al.  A Monitoring Sensor Management System for Grid Environments , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.

[6]  Jemal H. Abawajy Fault Detection Service Architecture for Grid Computing Systems , 2004, ICCSA.

[7]  Alexandre Duarte,et al.  Monitoring the EGEE/WLCG grid services , 2008 .

[8]  Francesco Palmieri,et al.  Towards a federated Metropolitan Area Grid environment: The SCoPE network-aware infrastructure , 2010, Future Gener. Comput. Syst..