August 13, 2003
(last updated September 11, 2003)
The workshop on Principles of Dependable Systems (PoDSy) was held on June 24, 2003 in conjunction with the International Conference on Dependable Systems and Networks (DSN 2003) at the Cathedral Hill Hotel in San Francisco, California. The event attracted up to 80 participants and was praised by attendees and the DSN organizers for its highly relevant choice of topics, quality reviewing, and excellent possibilities to learn and shape new ideas. This document briefly recalls the motivation and background of PoDSy, then reports on the workshop itself and finally concludes with some preliminary findings and results from the workshop.
The aim of PoDSy was to bring together researchers and practitioners from both the fault-tolerance and security communities to discuss foundational topics (and related applied experiences) on the similarities and differences between both areas. The workshop was jointly organized by Felix Gärtner (EPFL, Switzerland), Klaus Kursawe (IBM Zurich Research Laboratory, Switzerland), and Levente Buttyan (Budapest University of Technology and Economics, Hungary). Felix and Klaus were both present at the workshop.
Starting with the similarities, Paolo reported that the set of desired properties and their impairments are very similar in both fault tolerance and security. Properties can be specified as a set of safety and liveness properties while faults and attacks can be regarded as actions wishing to violate these properties. Hence, models and techniques from fault-tolerance can also be applied to achieve security. One of the main thrusts which is emerging from this perspective is the paradigm shift in security away from just attack prevention towards a form of attack tolerance, a point also mentioned in Cathy Meadows' talk.
A couple of points were put into similar perspective during the talk.
The notion of an attack was contrasted to a fault using the AVI model:
in this model, a fault is equivalent to an intrusion. The sources for an
intrusion are a vulnerability plus an attack (Attack + Vulnerability
= Intrusion, AVI). In contrast to faults, it is hard to estimate
the coverage of ``attacker assumption'' (this was called the coverage
problem). Paolo proposed to regard the level of trustworthiness
as a measure for putting reliance on secure systems: In secure systems,
component A often has to trust component B to achieve security.
The level of trustworthiness is a measure similar to assumption coverage
in fault tolerance, it is a probability of the trustworthiness being correct.
Trustworthiness can be increased through cryptographic protocols, tamper-proof
hardware etc. Similar to fault-tolerance, the coverage is never 1 for real
Fig. 1: Paolo Verissimo during his DSN PoDSy talk.
Paolo finally spoke about the issues of fault and timing assumptions in secure/intrusion-tolerant systems. He described two extreme perspectives: On the one hand people tend to make worst case assumptions on everything, e.g., take the Byzantine/arbitrary failure model and make no timing assumptions at all (asynchronous system assumption). This is the path of work often followed by security/cryptography people. However, worst case assumptions usually conflict with known impossibilities or lead to very expensive protocols that in most practical situations run very inefficiently. On the other hand, people coming from the area of fault-tolerance often make restricted fault assumptions on malicious components and assume at least some weak timeliness guarantees (e.g., partial synchrony). In this stream, it is hard to come up with "good" behavior restrictions of compromised components and timing guarantees are usually the first to be violated in presence of attacks (the coverage problem again).
Paolo argued for a notion he called architectural hybridization: Hybrid fault models should be considered more, i.e., combinations of arbitrary and more restricted faults. The hybrid fault assumption can be enforced by the system architecture, i.e., assuming components with different levels of trustworthiness (strong and weak). Strong components, e.g., tamper proof processing environments, can be employed strategically within the system where necessary. Similarly, regarding timeliness assumptions, he advocated the idea of wormholes, timely and secure computing channels, and presented architectural designs to build them. The design principle is to have the system execute the security critical operations in a small, highly secured subsystem with high confidence. A few case studies from the MAFTIA project applying these principles were presented.
Talk slides are available at http://lpdwww.epfl.ch/fgaertner/podsy2003/verissimo-podsy.pdf
One obvious approach in dealing with these difficulties is to ``ignore'', or better: abstract out, the cryptography. This is done in many approaches to cryptographic protocol design, for example in the well-known work of Dolev and Yao and the Spi Calculus of Abadi and Gordon. This approach simplifies the matter to such an extent that automatic analysis by tools becomes feasible. However, the assumption underlying this approach is that no new insecurities are built-in when the abstract cryptography is implemented with ``real'' cryptography.
The other approach is what Ran called the ``traditional cryptographic approach''. In this approach, basically nothing is abstracted away, the whole protocol is focus of analysis down to the number-theoretic properties of the used encryption scheme. Obviously, analysis in this approach achieves a higher level of confidence in the security of the system. But the proofs are not easily automatable and usually remain a hard, error prone undertaking.
Ideally, it would be nice to combine both approaches in a modular way: first analyze the system using established tools with abstracted ``black box'' cryptographic primitives, then analyze the cryptographic primitives separately using the traditional approach, and later combine them to an overall secure system. The only thing needed in this approach is a theorem which states: (1) if the abstract system is secure with black box cryptography, and (2) if the implementation of the cryptography is secure, then (3) the combination of both is also secure. Engineers can therefore analyze the building blocks separately and later put them together, relieving them from complex proofs or oversimplified results.
Ran presented the foundations for a theory that gives us such composition theorems. The definition of security underlying this theory is that of the area of secure multiparty computation: a secure system is defined using some 100% trusted service (such as a 100% trusted and available trusted third party). A real system is defined to be secure if it can ``emulate'' the ideal system. Emulation is defined probabilistically, i.e., both systems are indistinguishable with very high probability. His composition theorem even allows a more modular design of the security protocols themselves. This idea has been applied to many existing protocols such as digital signatures, key exchange, and secure communication. He noted that some of the case studies were performed by Michael Waidner's group at IBM Zurich, where a similar approach is pursued. What is still lacking are tools which can automate this approach, or at least support it to make parts more manageable.
Talk slides available at http://lpdwww.epfl.ch/fgaertner/podsy2003/canetti-podsy2003.pdf
Fig. 2: Ran Canetti at DSN PoDSy.
The first point, which Cathy made, concerned the worst case assumptions usually employed in security. Worst case assumptions give strong theorems but often solutions which are impractical to use in reality. These limitations are well-known today and Cathy takes this as a positive sign of a maturing research field. As an example, she presented the problem of information flow analysis in multi level security systems. While in earlier work, theories and techniques were established to build systems which were invulnerable to this type of problem (the work on non-interference), today people are looking in measuring the amount of information flowing over a covert channel if the worst case assumptions are (realistically) relaxed.
The second point was that security research is broadening its scope. While
in earlier days the focus was very much on secrecy implemented by ``perfect''
access control mechanisms after eliminating all covert channels, the
access control model was not very helpful when people started looking at
other security properties like integrity or availability: to maintain
integrity, an access control mechanism must not only control who writes data
but also what is written; to prevent denial-of-service (and maintain
availability), an access control mechanism must solve the problem of identifying
Fig. 3: Catherine Meadows revisits a 1995 paper at DSN PoDSy.
Today, we are seeing a more flexible use of attack (=fault) assumptions in security, depending on what situation we are trying to defend against. There can be faults in the security mechanism, the usual hostile attacks on the system, but also misuse made possible by insecure passwords, incorrect configuration, users opening email attachments from unknown sources etc. Today, security failures result in a combination of multiple system faults. A refined fault model makes it easier to identify these threats. Security has also profited from the dependability view by not relying on a single security mechanism anymore. For example, the use of honeypots to distract intruders basically acknowledges that the first defense perimeter can fail, but provides backup mechanisms to observe and identify the attacker.
According to Cathy, the analog of fault forecasting, i.e. measuring security, is still very much an open problem. Also, the effects on reliability of including security in a fault-tolerant system are not very clear. Analyzing security in a dynamic environment with attackers which change their abilities over time is also an area worth studying more.
Finally, Cathy returned to the 21st century and reviewed two new security paradigms which have evolved since: intrusion tolerance and survivability, both of which had an independent talk at PoDSy (see Paolo Verissimo's talk for intrusion tolerance and John Knight's talk on survivability). She closed in stating that fault forecast is the only item from the dependability taxonomy which has not been added to the security toolbox. This is a problem which seems to be much harder in security than in dependability.
Talk slides are available at http://lpdwww.epfl.ch/fgaertner/podsy2003/meadows-podsy2003.pdf
Fig 4: John Knight during the opening of his DSN PoDSy talk.
Informally, survivability can be thought of as the property of maintaining a certain system objective in the presence of attacks or other traumas including physical damage to equipment and random failures. This system objective need not necessarily be equivalent to the original system objective (like maintaining full communication facilities between all commanders on a battlefield); the system objective may change according to the severity or type of the trauma and the prevailing conditions. If the original system behavior is to calculate a function f(), then in the presence of certain traumas it might be required to compute a function g() which might be totally different from f(). This makes it different from masking fault-tolerance where f() is unchanged, and also different from graceful degradation where g() must be some definable subset of f(). As an example, depending on different traumas a command system might wish to maintain only communication between battlefield commanders and the commander-in-chief, not between the commanders themselves anymore. In a different scenario (i.e. with a different trauma), only local communication in geographical regions may be guaranteed.
From its definition, survivability is a very adaptable notion and can be applied to many different settings in practice. It can be applied in circumstances where full masking is infeasible or impossible, like nation-wide command and control systems or in safety critical systems, for which John gave examples.
Talk slides available at http://lpdwww.epfl.ch/fgaertner/podsy2003/knight-podsy2003.pdf
The first paper was entitled ``Whisper: A Local Secret Maintenance Protocol'' and written by Vinayak Naik, Anish Arora, Sandip Bapat (all from The Ohio State University, USA), and Mohamed Gouda (of The University of Texas at Austin, USA). Vinayak gave the talk and explained the workings of the Whisper protocol, a protocol that provably maintains security and fault-tolerance at the same time. The protocol maintains a shared secret between two nodes which is refreshed in an incremental fashion. The main security properties are forward and backward secrecy, meaning that the protocol maintains or recovers the shared secret even if parts of the secret are recovered by an attacker. The fault-tolerance properties involve stabilizing behavior if arbitrary (non-critical) variables are perturbed by faults. The protocol is aimed at the emerging field of sensor networks, networks of very small and computationally bounded devices. Therefore Whisper uses only computationally inexpensive cryptographic operations (hashes, one-way functions) to achieve security. This is maybe the most distinguishing feature of the protocol.
In the second talk, the focus was more on fault-tolerance. The paper was entitled ``Composing Distributed Fault-Tolerance Components'' and was authored by Sandeep Kulkarni, Karun Biyani, and Umamaheswaran Arumugam (all from Michigan State University, USA). Sandeep gave the talk. He described an approach to dynamically replace fault-tolerance components in a distributed system even if different components depend on each other. Replacement is done on-line, i.e., the approach sometimes must wait for certain components to cease interacting with other components (a method of ``freezing on demand'' was presented). The interesting feature of this approach is that it can be used to maintain arbitrary safety properties in the system. This does not restrict its applicability to fault-tolerance but also opens it for applications in security where the security property can be formalized as a safety property.
In the final talk, Dominic Duggan presented a paper entitled ``An Architecture
for Secure Fault-Tolerant Global Applications'' co-authored by Tom Cothia
(both from Stevens Institute of Technology, USA). The approach
falls into the class of language-based fault-tolerance and security, meaning
that certain basic concepts of dependable computing (like transaction consistency
despite failures or secrecy of communication despite eavesdroppers) should
be made accessible to the application programmer at different levels using
the right abstractions. For example, the high level design view should
not explicitly have to deal with encrypted communication or transaction
consistency while a low-level ``untyped'' implementation view (into which
high level designs are compiled) has direct access to different types of
encryption primitives. The aim of this research is ``global computing'',
the view that trusted computers in a distributed system like the Internet
can share resources and transparently simulate a global computer performing
the necessary application task. Dominic gave several programming language
examples and showed the rules for proving refinement correctness which are
based on process calculi.
Fig. 5: Snapshot of the panelists at DSN PoDSy 2003
looking at Yves Deswarte's presentation slides.
The panel began with a round of statements regarding answers to the panel question by all panelists. The final hour of the panel was spent discussing various aspects, including some provocative questions thrown in from the back of the room by Tom Anderson and Brian Randell.
The discussions at the panel including the initial statements were taped,
and a transcription will be made available as an EPFL Technical Report.
The coverage problem. In fault-tolerance, hardware faults are usually modeled as random, statistically independent events. This way of modeling has its roots in a long tradition of reliability engineering and can be backed by empirical measurements. Statistical independence allows to calculate coverage metrics for these types of faults, yielding a probability that a system will deliver its service in practice. This methodology works no matter how the faults manifest themselves, i.e., they can be simple stuck-at faults or very pessimistic arbitrary (or Byzantine) faults within a given fault region. There is no such thing in security. The problem in security is that it is hard to predict how attackers will operate, even if vulnerabilities of a certain system are known. It is an open question whether it is justifiable to assume any form of statistical (in)dependence between attacks or associate any kinds of probabilities with them. In this direction, security is closer to software fault-tolerance, an area in the dependability domain for which coverage calculations are also very hard.
An aspect of the coverage problem is the question of how to accurately model attacks. While not being able to assume statistical independence, it may also be a bad idea to specify attackers in the way it is done in fault-tolerance: state what the attacker can do. Rather an attacker should be specified by trust assumptions, or what he is assumed not to do. This indicates that assuming an attacker instead of hardware faults is a worse (or more pessimistic assumption); attackers can do more than hardware faults, they can conduct goal-oriented intentional actions. For example, deterministic error detection, like in CRCs, is easily defeated by an attacker and not by hardware faults. In this sense, an attacker assumption supersedes a fault assumption from the behavioral aspect (i.e., a fault is an attack but not vice versa).
The secrecy problem. In contrast to fault-tolerance, systems in security are often concerned with keeping things secret. Keeping things secret basically means that ``an attacker'' cannot infer anything about them. This assumes that there is some value in knowing data. Hence, the goal of secrecy also implies an adversarial notion which is fundamentally different from the faults which are considered in fault-tolerance. It is also well-known that secrecy properties are fundamentally different from other properties like safety and timeliness/liveness because they don't refine as nicely as the other properties: replacing a subcomponent by a different one with exactly the same input/output specification can make a system insecure with respect to secrecy. Therefore, new engineering approaches are necessary if secrecy is a goal.
To achieve secrecy, system engineers must therefore also employ new means which are not considered useful in fault-tolerance: these means come from the area of cryptography. There have been uses of cryptographic primitives in the area of fault-tolerance (the written messages algorithm of the Byzantine Generals is maybe the most famous), but in these contexts cryptographic primitives have been treated as black boxes with 100% security guarantees (e.g., encryption cannot be broken without the key). This however is often not justifiable in practice, firstly because of the refinement problem mentioned above, secondly because there are no 100% guarantees in practice, or at least the coverage of the assumption of perfect security is hard to measure (the coverage problem).
The organizers also wish to thank the contributing authors, the invited speakers, and the panelists for their valuable contributions to this event. Vital support by the DSN organizing committee, especially the workshop chair Neeraj Suri, and by the Deutsche Forschungsgemeinschaft (DFG) is also gratefully acknowledged.
This report, the slides of the invited presentations, and additional material can be found on the DSN PoDSy Website at: http://lpdwww.epfl.ch/fgaertner/podsy2003/