A Fault Avoidance Strategy Improving the Reliability of the EGI Production Grid Infrastructure

Reliability is a crucial issue for the development of stable and effective production grid infrastructures. That is, grid users must be able to trust upon the runtime service they request and receive from the underlying grid. Many runtime services and capabilities offered by modern Grid infrastructures are not available in advance to the application developers and dynamically bound only at the execution time, leading to an increased incidence of interaction faults. In this work we propose, implement and evaluate a novel low-impact fault-avoidance scheme, specifically conceived to improve the grid reliability from the user/application point of view, by providing proper service status information to the workload management system. In particular, starting from the EGEE experience, we designed a strategy inhibiting the use of some specific runtime capabilities on the available resources as soon as the monitoring system detect any anomalous behavior associated to these capabilities and re-integrating them when they restart to correctly work again. The results of a significant set of tests ran on the production EGEE infrastructure, have been presented to show the effectiveness of our approach.

[1]  Ian T. Foster,et al.  Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2004, Cluster Computing.

[2]  Ian T. Foster,et al.  The Anatomy of the Grid: Enabling Scalable Virtual Organizations , 2001, Int. J. High Perform. Comput. Appl..

[3]  Brian Tierney,et al.  A Monitoring Sensor Management System for Grid Environments , 2004, Cluster Computing.

[4]  Jon B. Weissman Fault tolerant computing on the grid: what are my options? , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[5]  Jemal H. Abawajy Fault Detection Service Architecture for Grid Computing Systems , 2004, ICCSA.

[6]  Christopher E. Dabrowski Reliability in grid computing systems , 2009 .

[7]  Gregor von Laszewski,et al.  A fault detection service for wide area distributed computations , 2004, Cluster Computing.

[8]  Jon B. Weissman Fault Tolerant Wide-Area Parallel Computing , 2000, IPDPS Workshops.

[9]  Antonio Laganà,et al.  Computational Science and Its Applications – ICCSA 2004 , 2004, Lecture Notes in Computer Science.

[10]  Steven Tuecke,et al.  The Anatomy of the Grid , 2003 .

[11]  Jemal H. Abawajy,et al.  Fault-tolerant Grid Resource Management Infrastructure , 2004, Neural Parallel Sci. Comput..

[12]  Eduardo Huedo,et al.  Evaluating the reliability of computational grids from the end user's point of view , 2006, J. Syst. Archit..

[13]  Erwin Laure,et al.  Middleware for the next generation Grid infrastructure , 2004 .

[14]  Soonwook Hwang,et al.  Grid workflow: a flexible failure handling framework for the grid , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.