Reliability Analysis in Distributed Systems

Reliability of a distributed processing system is an important design parameter that can be described in terms of the reliability of processing elements and communication links and also of the redundancy of programs and data files. The traditional terminal-pair reliability does not capture the redundancy of programs and files in a distributed system. Two reliability measures are introduced: distributed program reliability, which describes the probability of successful execution of a program requiring cooperation of several computers, and distributed system reliability, which is the probability that all the specified distributed programs for the system are operational. These two reliability measures can be extended to incorporate the effects of user sites on reliability. An efficient approach based on graph traversal is developed to evaluate the proposed reliability measures. >

[1]  Michael O. Ball Computing Network Reliability , 1979, Oper. Res..

[2]  John A. Stankovic,et al.  A Perspective on Distributed Computer Systems , 1984, IEEE Transactions on Computers.

[3]  Salim Hariri,et al.  SYREL: A Symbolic Reliability Algorithm Based on Path and Cutset Methods , 1987, IEEE Transactions on Computers.

[4]  Jeremy Dion,et al.  The Cambridge File Server , 1980, OPSR.

[5]  Hector Garcia-Molina,et al.  Reliability issues for fully replicated distributed databases , 1982, Computer.

[6]  A. Satyanarayana,et al.  A Unified Formula for Analysis of Some Network Reliability Problems , 1982, IEEE Transactions on Reliability.

[7]  Butler W. Lampson,et al.  Distributed Systems — Architecture and Implementation , 1982, Lecture Notes in Computer Science.

[8]  David A. Rennels Distributed Fault-Tolerant Computer Systems , 1980, Computer.

[9]  B. J. Leon,et al.  A New Algorithm for Symbolic System Reliability Analysis , 1976, IEEE Transactions on Reliability.

[10]  J. Abraham An Improved Algorithm for Network Reliability , 1979, IEEE Transactions on Reliability.

[11]  Suresh Rai,et al.  Reliability Evaluation in Computer-Communication Networks , 1981, IEEE Transactions on Reliability.

[12]  Salim Hariri,et al.  RELIABILITY MEASURES FOR DISTRIBUTED PROCESSING SYSTEMS. , 1985 .

[13]  Mario Gerla,et al.  A new algorithm for symbolic reliability analysis of computer - Communication networks , 1980 .

[14]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[15]  Viktor K. Prasanna,et al.  Distributed program reliability analysis , 1986, IEEE Transactions on Software Engineering.

[16]  Walter H. Kohler,et al.  A Survey of Techniques for Synchronization and Recovery in Decentralized Computer Systems , 1981, CSUR.

[17]  James E. Allchin,et al.  Architecture for a Global Operating System , 1983, INFOCOM.

[18]  Philip H. Enslow What is a "Distributed" Data Processing System? , 1978, Computer.

[19]  Keki B. Irani,et al.  A Methodology for the Design of Communication Networks and the Distribution of Data in Distributed Supercomputer Systems , 1982, IEEE Transactions on Computers.

[20]  Jacob A. Abraham,et al.  Load Redistribution Under Failure in Distributed Systems , 1983, IEEE Transactions on Computers.

[21]  Richard E. Merwin,et al.  Derivation and use of a survivability criterion for DDP systems , 1980, AFIPS '80.

[22]  James P. Ignizio,et al.  A Multicriteria Approach to Supersystem Architecture Definition , 1982, IEEE Transactions on Computers.

[23]  Daniel A. Menascé,et al.  Locking and Deadlock Detection in Distributed Data Bases , 1979, IEEE Transactions on Software Engineering.