Report on the DSN 2003 Workshop on
Principles of Dependable Systems (PoDSy)

Felix C. Gärtner
École Polytechnique Fédérale de Lausanne (EPFL)
Departement Informatique et Communications
Laboratoire de Programmation Distribuée
CH-1015 Lausanne
Switzerland

August 13, 2003
(last updated September 11, 2003)

Abstract

The workshop on Principles of Dependable Systems (PoDSy) was held on June 24, 2003 in conjunction with the International Conference on Dependable Systems and Networks (DSN 2003) at the Cathedral Hill Hotel in San Francisco, California. The event attracted up to 80 participants and was praised by attendees and the DSN organizers for its highly relevant choice of topics, quality reviewing, and excellent possibilities to learn and shape new ideas. This document briefly recalls the motivation and background of PoDSy, then reports on the workshop itself and finally concludes with some preliminary findings and results from the workshop.

1. Scope and Goals of DSN PoDSy 2003

Dependable systems are supposed to satisfy an ensemble of distinct properties, namely safety, security and availability, to name a few. These properties are in parts complementary and also diverse enough to have spawned complete topic areas of their own. Consequently, work on achieving and validating the different properties has partly been performed in different communities and with varied nuances. Maybe most prominently this is true for the two areas of fault-tolerant systems on the one hand and secure systems (especially cryptography) on the other. For example, researchers in fault-tolerance often make statements about systems by treating cryptographic primitives as black boxes. This is done to simplify analysis and (sometimes) avoid number and probability theory. However, by abstracting away the basic properties of the cryptographic primitives, this severely constrains the ability to conduct rigorous security proofs. Various examples of the past show that by over-abstraction, important attributes got neglected, contributing to attack vulnerabilities in the resultant protocols. But despite these examples, many researchers have confirmed that there are strong similarities between the ways of modeling and handling uncertainty in both areas.

The aim of PoDSy was to bring together researchers and practitioners from both the fault-tolerance and security communities to discuss foundational topics (and related applied experiences) on the similarities and differences between both areas. The workshop was jointly organized by Felix Gärtner (EPFL, Switzerland), Klaus Kursawe (IBM Zurich Research Laboratory, Switzerland), and Levente Buttyan (Budapest University of Technology and Economics, Hungary). Felix and Klaus were both present at the workshop.

2. Invited talks

The workshop was structured around a set of four invited talks by well-known specialists having experiences in both areas: Paolo Verissimo (University of Lisboa, Portugal), Ran Canetti (IBM Research, USA), Catherine Meadows (Naval Research Lab, USA), and John C. Knight (University of Virginia, USA). This section gives brief summaries of these talks. The invited talks were generally attended by around 60 people.

2.1 Paulo Verissimo on Dependability and Security

In his refreshing talk style, Paolo Verissimo of the University of Lisboa, Portugal, presented his view of the DSN PoDSy scope in a presentation entitled ``Dependability, Security, two faces of a same coin?''. Paolo reported from his experiences in the context of the multi-year EU-funded MAFTIA project which aimed at bringing together fault-tolerance and security to form intrusion-tolerant systems.

Starting with the similarities, Paolo reported that the set of desired properties and their impairments are very similar in both fault tolerance and security. Properties can be specified as a set of safety and liveness properties while faults and attacks can be regarded as actions wishing to violate these properties. Hence, models and techniques from fault-tolerance can also be applied to achieve security. One of the main thrusts which is emerging from this perspective is the paradigm shift in security away from just attack prevention towards a form of attack tolerance, a point also mentioned in Cathy Meadows' talk.

A couple of points were put into similar perspective during the talk. The notion of an attack was contrasted to a fault using the AVI model: in this model, a fault is equivalent to an intrusion. The sources for an intrusion are a vulnerability plus an attack (Attack + Vulnerability = Intrusion, AVI). In contrast to faults, it is hard to estimate the coverage of ``attacker assumption'' (this was called the coverage problem). Paolo proposed to regard the level of trustworthiness as a measure for putting reliance on secure systems: In secure systems, component A often has to trust component B to achieve security. The level of trustworthiness is a measure similar to assumption coverage in fault tolerance, it is a probability of the trustworthiness being correct. Trustworthiness can be increased through cryptographic protocols, tamper-proof hardware etc. Similar to fault-tolerance, the coverage is never 1 for real systems.

Fig. 1: Paolo Verissimo during his DSN PoDSy talk.

Paolo finally spoke about the issues of fault and timing assumptions in secure/intrusion-tolerant systems. He described two extreme perspectives: On the one hand people tend to make worst case assumptions on everything, e.g., take the Byzantine/arbitrary failure model and make no timing assumptions at all (asynchronous system assumption). This is the path of work often followed by security/cryptography people. However, worst case assumptions usually conflict with known impossibilities or lead to very expensive protocols that in most practical situations run very inefficiently. On the other hand, people coming from the area of fault-tolerance often make restricted fault assumptions on malicious components and assume at least some weak timeliness guarantees (e.g., partial synchrony). In this stream, it is hard to come up with "good" behavior restrictions of compromised components and timing guarantees are usually the first to be violated in presence of attacks (the coverage problem again).

Paolo argued for a notion he called architectural hybridization: Hybrid fault models should be considered more, i.e., combinations of arbitrary and more restricted faults. The hybrid fault assumption can be enforced by the system architecture, i.e., assuming components with different levels of trustworthiness (strong and weak). Strong components, e.g., tamper proof processing environments, can be employed strategically within the system where necessary. Similarly, regarding timeliness assumptions, he advocated the idea of wormholes, timely and secure computing channels, and presented architectural designs to build them. The design principle is to have the system execute the security critical operations in a small, highly secured subsystem with high confidence. A few case studies from the MAFTIA project applying these principles were presented.

Talk slides are available at http://lpdwww.epfl.ch/fgaertner/podsy2003/verissimo-podsy.pdf

2.2 Ran Canetti on Treating Cryptography as a Black Box

Security protocols cannot avoid using cryptographic devices, and so methods to analyze real systems need to be able to reason about cryptography. Ran Canetti of IBM Research gave an insightful talk entitled ``On the Composability, Modularity, and Security of Cryptographic Protocols'' on the possibilities of doing exactly this: reasoning about protocols that involve cryptography. He began by describing the challenges of analyzing security protocols which arise from the many subtleties which can be exploited by an attacker and from the fact that the security goals of cryptographic protocols are often not absolute (defined using probabilities and notions of complexity theory). The difficulties remain even if cryptography is employed in just a very small part of the system because many security properties of the protocols interact badly with changes in the application system which uses them.

One obvious approach in dealing with these difficulties is to ``ignore'', or better: abstract out, the cryptography. This is done in many approaches to cryptographic protocol design, for example in the well-known work of Dolev and Yao and the Spi Calculus of Abadi and Gordon. This approach simplifies the matter to such an extent that automatic analysis by tools becomes feasible. However, the assumption underlying this approach is that no new insecurities are built-in when the abstract cryptography is implemented with ``real'' cryptography.

The other approach is what Ran called the ``traditional cryptographic approach''. In this approach, basically nothing is abstracted away, the whole protocol is focus of analysis down to the number-theoretic properties of the used encryption scheme. Obviously, analysis in this approach achieves a higher level of confidence in the security of the system. But the proofs are not easily automatable and usually remain a hard, error prone undertaking.

Ideally, it would be nice to combine both approaches in a modular way: first analyze the system using established tools with abstracted ``black box'' cryptographic primitives, then analyze the cryptographic primitives separately using the traditional approach, and later combine them to an overall secure system. The only thing needed in this approach is a theorem which states: (1) if the abstract system is secure with black box cryptography, and (2) if the implementation of the cryptography is secure, then (3) the combination of both is also secure. Engineers can therefore analyze the building blocks separately and later put them together, relieving them from complex proofs or oversimplified results.

Ran presented the foundations for a theory that gives us such composition theorems. The definition of security underlying this theory is that of the area of secure multiparty computation: a secure system is defined using some 100% trusted service (such as a 100% trusted and available trusted third party). A real system is defined to be secure if it can ``emulate'' the ideal system. Emulation is defined probabilistically, i.e., both systems are indistinguishable with very high probability. His composition theorem even allows a more modular design of the security protocols themselves. This idea has been applied to many existing protocols such as digital signatures, key exchange, and secure communication. He noted that some of the case studies were performed by Michael Waidner's group at IBM Zurich, where a similar approach is pursued. What is still lacking are tools which can automate this approach, or at least support it to make parts more manageable.

Talk slides available at http://lpdwww.epfl.ch/fgaertner/podsy2003/canetti-podsy2003.pdf

Fig. 2: Ran Canetti at DSN PoDSy.

2.3 Catherine Meadows on Dependability Paradigms in Security

In a talk entitled "Applying the Dependability Paradigm to Computer Security", Catherine Meadows of the Naval Research Laboratory began with a historical perspective: she looked back at a talk she had given in 1995 at the New Security Paradigms Workshop in which she had first applied the dependability taxonomy to the area of security. This was done to identify whether the security community was doing related things and if not, whether the security community should start looking in similar directions. The dependability taxonomy consists of the four basic means of dependability: fault prevention, fault removal, fault tolerance and fault forecasting. In 1995, people tried to perform fault prevention by employing good software engineering practices and testing. People were also starting to look at fault identification and removal (this was the beginning of intrusion detection). There was little which could be related to fault tolerance or fault forecasting. The final thesis of that 1995 talk was that the paradigm of worst-case assumptions employed in security at that time is slowly becoming obsolete: fault models need to be adapted to non-worst case assumptions of reality, similar to the fault assumptions in fault tolerance. She also identified that there was a growing need to be able to measure and estimate security (the analogy of fault forecasting). The aim of this talk was to investigate what has changed since 1995.

The first point, which Cathy made, concerned the worst case assumptions usually employed in security. Worst case assumptions give strong theorems but often solutions which are impractical to use in reality. These limitations are well-known today and Cathy takes this as a positive sign of a maturing research field. As an example, she presented the problem of information flow analysis in multi level security systems. While in earlier work, theories and techniques were established to build systems which were invulnerable to this type of problem (the work on non-interference), today people are looking in measuring the amount of information flowing over a covert channel if the worst case assumptions are (realistically) relaxed.

The second point was that security research is broadening its scope. While in earlier days the focus was very much on secrecy implemented by ``perfect'' access control mechanisms after eliminating all covert channels, the access control model was not very helpful when people started looking at other security properties like integrity or availability: to maintain integrity, an access control mechanism must not only control who writes data but also what is written; to prevent denial-of-service (and maintain availability), an access control mechanism must solve the problem of identifying malicious users.

Fig. 3: Catherine Meadows revisits a 1995 paper at DSN PoDSy.

Today, we are seeing a more flexible use of attack (=fault) assumptions in security, depending on what situation we are trying to defend against. There can be faults in the security mechanism, the usual hostile attacks on the system, but also misuse made possible by insecure passwords, incorrect configuration, users opening email attachments from unknown sources etc. Today, security failures result in a combination of multiple system faults. A refined fault model makes it easier to identify these threats. Security has also profited from the dependability view by not relying on a single security mechanism anymore. For example, the use of honeypots to distract intruders basically acknowledges that the first defense perimeter can fail, but provides backup mechanisms to observe and identify the attacker.

According to Cathy, the analog of fault forecasting, i.e. measuring security, is still very much an open problem. Also, the effects on reliability of including security in a fault-tolerant system are not very clear. Analyzing security in a dynamic environment with attackers which change their abilities over time is also an area worth studying more.

Finally, Cathy returned to the 21st century and reviewed two new security paradigms which have evolved since: intrusion tolerance and survivability, both of which had an independent talk at PoDSy (see Paolo Verissimo's talk for intrusion tolerance and John Knight's talk on survivability). She closed in stating that fault forecast is the only item from the dependability taxonomy which has not been added to the security toolbox. This is a problem which seems to be much harder in security than in dependability.

Talk slides are available at http://lpdwww.epfl.ch/fgaertner/podsy2003/meadows-podsy2003.pdf

2.4 John Knight on Survivability

Survivability is one of the new emerging joint security and dependability paradigms. John Knight of the University of Virginia gave an insightful talk about this new paradigm entitled ``Survivability: What Is It And What Can It Be Used For?''. He started by giving some amusingly circular definitions of the term from different application areas to show the difficulty in defining the term. It some cases, survivability is taken as a synonym for security but this is inappropriate because systems have to deal with a wider range of problems than this implies. Survivability is better thought of as a tradeoff between preservation of function and the cost of achieving it (it is therefore a classical dependability property).

Fig 4: John Knight during the opening of his DSN PoDSy talk.

Informally, survivability can be thought of as the property of maintaining a certain system objective in the presence of attacks or other traumas including physical damage to equipment and random failures. This system objective need not necessarily be equivalent to the original system objective (like maintaining full communication facilities between all commanders on a battlefield); the system objective may change according to the severity or type of the trauma and the prevailing conditions. If the original system behavior is to calculate a function f(), then in the presence of certain traumas it might be required to compute a function g() which might be totally different from f(). This makes it different from masking fault-tolerance where f() is unchanged, and also different from graceful degradation where g() must be some definable subset of f(). As an example, depending on different traumas a command system might wish to maintain only communication between battlefield commanders and the commander-in-chief, not between the commanders themselves anymore. In a different scenario (i.e. with a different trauma), only local communication in geographical regions may be guaranteed.

From its definition, survivability is a very adaptable notion and can be applied to many different settings in practice. It can be applied in circumstances where full masking is infeasible or impossible, like nation-wide command and control systems or in safety critical systems, for which John gave examples.

Talk slides available at http://lpdwww.epfl.ch/fgaertner/podsy2003/knight-podsy2003.pdf

3. Research Papers

The workshop managed to attract just four submissions, three of which were selected for presentation. There are many reasons for this relatively small number of submissions, the competition with similar and more established events in the security field being one of them (for example the IEEE Computer Security Foundations Workshop was held in the Bay Area in the week following DSN). Choosing DSN as the conjunct conference also attracted more submissions from the traditional dependability community than from the security community. Nevertheless, the papers which were presented were of very good quality and well worth presenting and discussing at the workshop. Each paper received at least 3 reviews from members of the program committee. Authors unanimously acknowledged the wealth of feedback received through reviews. This section briefly recalls the paper presentations. This (after lunch) session attracted about 30 people in the audience.

The first paper was entitled ``Whisper: A Local Secret Maintenance Protocol'' and written by Vinayak Naik, Anish Arora, Sandip Bapat (all from The Ohio State University, USA), and Mohamed Gouda (of The University of Texas at Austin, USA). Vinayak gave the talk and explained the workings of the Whisper protocol, a protocol that provably maintains security and fault-tolerance at the same time. The protocol maintains a shared secret between two nodes which is refreshed in an incremental fashion. The main security properties are forward and backward secrecy, meaning that the protocol maintains or recovers the shared secret even if parts of the secret are recovered by an attacker. The fault-tolerance properties involve stabilizing behavior if arbitrary (non-critical) variables are perturbed by faults. The protocol is aimed at the emerging field of sensor networks, networks of very small and computationally bounded devices. Therefore Whisper uses only computationally inexpensive cryptographic operations (hashes, one-way functions) to achieve security. This is maybe the most distinguishing feature of the protocol.

In the second talk, the focus was more on fault-tolerance. The paper was entitled ``Composing Distributed Fault-Tolerance Components'' and was authored by Sandeep Kulkarni, Karun Biyani, and Umamaheswaran Arumugam (all from Michigan State University, USA). Sandeep gave the talk. He described an approach to dynamically replace fault-tolerance components in a distributed system even if different components depend on each other. Replacement is done on-line, i.e., the approach sometimes must wait for certain components to cease interacting with other components (a method of ``freezing on demand'' was presented). The interesting feature of this approach is that it can be used to maintain arbitrary safety properties in the system. This does not restrict its applicability to fault-tolerance but also opens it for applications in security where the security property can be formalized as a safety property.

In the final talk, Dominic Duggan presented a paper entitled ``An Architecture for Secure Fault-Tolerant Global Applications'' co-authored by Tom Cothia (both from Stevens Institute of Technology, USA). The approach falls into the class of language-based fault-tolerance and security, meaning that certain basic concepts of dependable computing (like transaction consistency despite failures or secrecy of communication despite eavesdroppers) should be made accessible to the application programmer at different levels using the right abstractions. For example, the high level design view should not explicitly have to deal with encrypted communication or transaction consistency while a low-level ``untyped'' implementation view (into which high level designs are compiled) has direct access to different types of encryption primitives. The aim of this research is ``global computing'', the view that trusted computers in a distributed system like the Internet can share resources and transparently simulate a global computer performing the necessary application task. Dominic gave several programming language examples and showed the rules for proving refinement correctness which are based on process calculi.

4. Panel

The panel ran under the heading of ``What can fault-tolerance people learn from security people and vice versa?'' and was the most heavily attended event at PoDSy; around 80 people attended the panel, some of them standing for lack of seating space. The set of high-profile panelists involved Yves Deswarte (LAAS, France), Leslie Lamport (Microsoft Research, USA), Roy Maxion (Carnegie Mellon University, USA), and Jonathan Millen (SRI, USA). Invited as panelist also was Neeraj Suri (TU Darmstadt, Germany) who couldn't make it in time but had sent a short (three slide) statement in advance for this case.

Fig. 5: Snapshot of the panelists at DSN PoDSy 2003
looking at Yves Deswarte's presentation slides.

The panel began with a round of statements regarding answers to the panel question by all panelists. The final hour of the panel was spent discussing various aspects, including some provocative questions thrown in from the back of the room by Tom Anderson and Brian Randell.

The discussions at the panel including the initial statements were taped, and a transcription will be made available as an EPFL Technical Report.

5. Findings

In this final section of the workshop report, I will try and summarize the major insights into the relationship between the areas of fault-tolerance and security which came up at the workshop. These insights can also be regarded as questions, a basis for future work in developing a joint view and engineering methodology for both areas.

5.1 Similarities

In general, the fault-tolerance and the security perspective on systems is very similar. Both areas aim at systems that should perform a certain function even in the presence of an unfriendly environment. To build real systems that have these properties (no matter whether fault-tolerance or security), you need to be very careful. In the validation phase of such systems, both fields have means to model the unfriendly environment and reason about their system given such an environment. In both cases, you should be rather pessimistic in determining your assumptions. However, the nature of these assumptions as well as the types of properties to be achieved are fundamentally different.

5.2 Differences

There seem to be two major differences between the two areas which I will refer to as the secrecy problem and the coverage problem. Both are interrelated.

The coverage problem. In fault-tolerance, hardware faults are usually modeled as random, statistically independent events. This way of modeling has its roots in a long tradition of reliability engineering and can be backed by empirical measurements. Statistical independence allows to calculate coverage metrics for these types of faults, yielding a probability that a system will deliver its service in practice. This methodology works no matter how the faults manifest themselves, i.e., they can be simple stuck-at faults or very pessimistic arbitrary (or Byzantine) faults within a given fault region. There is no such thing in security. The problem in security is that it is hard to predict how attackers will operate, even if vulnerabilities of a certain system are known. It is an open question whether it is justifiable to assume any form of statistical (in)dependence between attacks or associate any kinds of probabilities with them. In this direction, security is closer to software fault-tolerance, an area in the dependability domain for which coverage calculations are also very hard.

An aspect of the coverage problem is the question of how to accurately model attacks. While not being able to assume statistical independence, it may also be a bad idea to specify attackers in the way it is done in fault-tolerance: state what the attacker can do. Rather an attacker should be specified by trust assumptions, or what he is assumed not to do. This indicates that assuming an attacker instead of hardware faults is a worse (or more pessimistic assumption); attackers can do more than hardware faults, they can conduct goal-oriented intentional actions. For example, deterministic error detection, like in CRCs, is easily defeated by an attacker and not by hardware faults. In this sense, an attacker assumption supersedes a fault assumption from the behavioral aspect (i.e., a fault is an attack but not vice versa).

The secrecy problem. In contrast to fault-tolerance, systems in security are often concerned with keeping things secret. Keeping things secret basically means that ``an attacker'' cannot infer anything about them. This assumes that there is some value in knowing data. Hence, the goal of secrecy also implies an adversarial notion which is fundamentally different from the faults which are considered in fault-tolerance. It is also well-known that secrecy properties are fundamentally different from other properties like safety and timeliness/liveness because they don't refine as nicely as the other properties: replacing a subcomponent by a different one with exactly the same input/output specification can make a system insecure with respect to secrecy. Therefore, new engineering approaches are necessary if secrecy is a goal.

To achieve secrecy, system engineers must therefore also employ new means which are not considered useful in fault-tolerance: these means come from the area of cryptography. There have been uses of cryptographic primitives in the area of fault-tolerance (the written messages algorithm of the Byzantine Generals is maybe the most famous), but in these contexts cryptographic primitives have been treated as black boxes with 100% security guarantees (e.g., encryption cannot be broken without the key). This however is often not justifiable in practice, firstly because of the refinement problem mentioned above, secondly because there are no 100% guarantees in practice, or at least the coverage of the assumption of perfect security is hard to measure (the coverage problem).

5.3 Future Work

The above findings call for additional attention to a class of (partly old) research questions:

How can we measure security? What types of metrics can indicate the assurance of a certain established security level?
What are the basic guiding engineering principles to build systems that achieve secrecy in addition to classical fault-tolerance properties like safety and liveness?

The author invites the reader to investigate these and other related questions further, or to communicate his/her views to the author whenever suitable.

Acknowledgments

The organizers wish to thank the members of the program committee for their careful reviews of the submitted research papers. The program committee consisted of the following researchers: Levente Buttyan (Budapest University of Technology and Economics, Hungary), Christian Cachin (IBM Zurich Research Laboratory, Switzerland), Felix Gärtner (EPFL, Switzerland), Rachid Guerraoui (EPFL, Switzerland), Klaus Kursawe (IBM Zurich Research Laboratoy, Switzerland), Heiko Mantel (DFKI, Germany), Catherine Meadows (Naval Research Laboratoty, USA), Peter Ryan (University of Newcastle, UK), Steve Schneider (University of London, UK), Neeraj Suri (TU Darmstadt, Germany), Paulo Verissimo (University of Lisboa, Portugal), Dennis Volpano (Naval Postgraduate School, USA), Lidong Zhou (Microsoft Research, USA).

The organizers also wish to thank the contributing authors, the invited speakers, and the panelists for their valuable contributions to this event. Vital support by the DSN organizing committee, especially the workshop chair Neeraj Suri, and by the Deutsche Forschungsgemeinschaft (DFG) is also gratefully acknowledged.

This report, the slides of the invited presentations, and additional material can be found on the DSN PoDSy Website at: http://lpdwww.epfl.ch/fgaertner/podsy2003/

Report on the DSN 2003 Workshop on Principles of Dependable Systems (PoDSy)