Felix C. Gärtner: Fundamentals of Fault Tolerant
Distributed Computing in Asynchronous Environments.
Technical Report TUD-BS-1998-02, Department of Computer Science,
Darmstadt University of Technology, Darmstadt, Germany, July 1998.
Abstract
Fault tolerance in distributed computing is a wide area with a
significant body of literature that is vastly diverse in methodology
and terminology. This paper aims at structuring the area and thus
guiding readers into this interesting field of computer science. The
paper uses a formal approach to define important terms like fault,
fault tolerance and redundancy. This leads to
four distinct forms of fault tolerance and to the two main phases in
achieving them called detection and correction. It is
shown that this can help reveal fundamental structures inherent in
the field that contribute to the understanding and unification of
methods and terminology. By doing this, many existing methodologies
of this area of computer science are surveyed and their relations
are discussed. The underlying system model is the close-to-reality
asynchronous message-passing model of distributed computations.
Available as Postscript file.
Felix Gärtner (felix@informatik.tu-darmstadt.de)