Felix C. Gärtner: Fundamentals of Fault Tolerant Distributed Computing in Asynchronous Environments. Technical Report TUD-BS-1998-02, Department of Computer Science, Darmstadt University of Technology, Darmstadt, Germany, July 1998.


Abstract

Fault tolerance in distributed computing is a wide area with a significant body of literature that is vastly diverse in methodology and terminology. This paper aims at structuring the area and thus guiding readers into this interesting field of computer science. The paper uses a formal approach to define important terms like fault, fault tolerance and redundancy. This leads to four distinct forms of fault tolerance and to the two main phases in achieving them called detection and correction. It is shown that this can help reveal fundamental structures inherent in the field that contribute to the understanding and unification of methods and terminology. By doing this, many existing methodologies of this area of computer science are surveyed and their relations are discussed. The underlying system model is the close-to-reality asynchronous message-passing model of distributed computations.
Available as Postscript file.


Felix Gärtner (felix@informatik.tu-darmstadt.de)