STSC Logo About Us Consulting Services CrossTalk STC Conference Resources


Software Technology Support Center


About CrossTalk

  - Mission
  - Staff
  - Contact Us


About Us

Subscription

  - Subscribe Now
  - Update
  - Cancel
  - 


Themes Calendar

Author Guidelines

Back Issues

Article Index

Your Comments
Home > CrossTalk Oct 2000 > Article

CrossTalk - The Journal of Defense Software Engineering
Oct 2000 Issue

The Survivability Imperative: Protecting Critical Systems
Robert J. Ellison, Software Engineering Institute
Nancy R. Mead, Software Engineering Institute
Thomas A. Longstaff, Software Engineering Institute
Richard C. Linger, Software Engineering Institute

The success of virtually all organizations in defense, government, and business is dependent on availability and correct functionality of large-scale networked information systems of remarkable complexity. Because of the severe consequences of failure, organizations are focusing on system survivability as a key risk management strategy. The Survivable Network Analysis (SNA) method provides a systematic means to assess and improve system survivability for risk reduction. Survivability can also be integrated into requirements definition for new or evolving systems.
Progress Demands System Survivability

Modern society is increasingly dependent upon large-scale, highly distributed systems that operate in unbounded network environments. Such systems improve efficiency by permitting entire new levels of organizational integration, but they also introduce elevated risks of intrusion and compromise. These risks can be mitigated within the organization's system by incorporating survivability capabilities.

Unbounded networks such as the Internet have no central administrative control and no unified security policy. Furthermore, the number and nature of nodes connected to such networks cannot be fully known. Despite the best efforts of security practitioners, no amount of hardening can assure that a system connected to an unbounded network will be invulnerable to attack.

The discipline of survivability can help ensure that systems can deliver essential services and maintain essential properties including integrity, confidentiality, and performance despite the presence of intrusions. Unlike traditional security measures, which often depend on central control and administration, survivability is intended to address network environments where such capabilities may not exist.

Glossary of survivability terms
Glossary of survivability terms

Survivability is defined as the capability of a system to fulfill its mission in a timely manner, even in the presence of attacks, failures, or accidents. As an emerging discipline, survivability builds on related fields of study, including security, fault tolerance, safety, reliability, reuse, performance, verification, and testing; moreover, it introduces new concepts and principles [1, 2, 3, 4, 5]. Survivability focuses on preserving essential services in unbounded environments, even when systems are penetrated and compromised.

In defining survivability, the term mission refers to high-level organizational objectives. Missions are not limited to military settings; any successful organization or project must have a vision of its objectives, whether expressed implicitly or as a formal mission statement. Judging mission fulfillment is typically made in the context of external conditions that affect achievement of mission objectives.

For example, a financial system may shut down for 12 hours during a period of widespread power outages caused by a hurricane. If the system preserves integrity and confidentiality of data and resumes essential services following the period of downtime, it can reasonably be judged to have fulfilled its mission. However, if the system shuts down unexpectedly for 12 hours under normal conditions or minor environmental stress and deprives users of essential financial services, it can be judged to have failed its mission, even if integrity and confidentiality are preserved.

Timeliness is typically a critical factor in mission objectives, and is explicitly included in the definition of survivability. The terms attack, failure, and accident include all potentially damaging events; however, these terms do not partition events into mutually exclusive or even distinguishable sets. It is often difficult to determine if a particular detrimental event is the result of a malicious attack, a component failure, or an accident. Even if the cause is eventually determined, the critical immediate response cannot depend on speculations about the cause.

Attacks are potentially damaging events orchestrated by an intelligent adversary. Attacks include intrusions, probes, and denials of service. Moreover, the threat of an attack can have as severe an impact on a system as an actual occurrence. A system that assumes an overly defensive position because of an attack threat may significantly reduce functionality and divert excessive resources to monitoring the environment and protecting system assets.

Failures are potentially damaging events caused by deficiencies in a system or in an external element upon which the system depends. Failures may be due to software design errors, hardware degradation, human errors, or corrupted data.

Accidents describe a broad range of randomly occurring and potentially damaging events, such as natural disasters, that usually originate outside a system.

With respect to survivability, a distinction between an attack and failure or accident is less important than the impact of the event. It is often not possible to distinguish between intelligently orchestrated attacks and unintentional or random detrimental events. Survivability concentrates on the effect of a potentially damaging event. For a system to survive, it must recover from a damaging effect long before the underlying cause is identified. In fact, recovery must be successful whether or not the cause is ever determined.

It is important to recognize that mission fulfillment must survive-not any particular subsystem or component. The core concept of survivability is the capability of a system to fulfill its mission, even if significant portions of the system are damaged or destroyed.

Survivable Network Analysis

The SNA method depicted in Figure 1 was developed by the SEI Computer Emergency Response Team (CERT) Coordination Center as a practical engineering process for systematic assessment of survivability properties of proposed systems, existing systems, and modifications to existing systems [6, 7]. SNA is carried out at the architecture level as a cooperative project by an SEI team working with system architects, developers, and stake-holders. The method proceeds through a series of joint working sessions, culminating in a briefing on findings and recommendations. In this article, the focus is on attacks, although the trace-based, compositional SNA method applies to analysis of failures and accidents as well.

Figure 1: The survivable network analysis method
Figure 1: The survivable network analysis method
(Click on image above to show full-size version in pop-up window.)

The SNA method provides a means for organizations to understand survivability in the context of their operating environments. What functions must survive? What intrusions could occur? How could intrusions affect survivability? What are the risks to the mission? How could architecture modifications reduce the risks? Systematic consideration of these questions through SNA reveals the risks and leads to mitigation strategies. Steps in the SNA method are defined as follows:

Step One: System Definition

The first step focuses on understanding mission objectives, requirements for the current or candidate system, structure and properties of the system architecture, and risks in the operational environment.

Step Two: Essential Capability Definition

Once step one is complete, essential services (services that must be maintained during attack) and essential assets (assets whose integrity, confidentiality, availability, and other properties must be maintained during attack) are identified, based on mission objectives and the consequences of failure. Essential service and asset uses are characterized by usage scenarios, which are traced through the architecture to identify essential components whose survivability must be ensured.

Step Three: Compromisable Capability Definition

Next, intrusion scenarios are selected based on assessment of environmental risks and intruder capabilities. These scenarios are likewise mapped onto the architecture as execution traces to identify corresponding compromisable components (components that could be penetrated and damaged by intrusion). In essence, intruders are treated as simply another class of users, and the design task for intrusion usage is to make it as difficult and costly as possible.

Step Four: Survivability Analysis

The final step of the SNA method takes aim at soft spot components of the architecture. These are components that prove both essential and compromisable, based on the results of steps two and three. Soft spot components and supporting architecture are then analyzed for the key survivability properties of resistance, recognition, and recovery (the three Rs), as well as for adaptation and evolution.

Resistance is the capability of a system to repel attacks. Recognition is the system's capability to detect attacks as they occur and to evaluate the extent of damage and compromise. Recovery, a hallmark of survivability, is the capability to maintain essential services and assets during attack, limit the extent of damage, and restore full services following attack. Table 1 depicts some strategies for improving survivability.

Table 1: Some strategies fo improving system survivability
Table 1: Some strategies fo improving system survivability
(Click on image above to show full-size version in pop-up window.)

The analysis of the "three R's" is summarized in a Survivability Map as depicted in Figure 2. The map enumerates, for every intrusion scenario and its corresponding soft spot effects, the current and recommended architecture strategies for resistance, recognition, and recovery. The Survivability Map provides feedback about the original architecture and system requirements, and gives management a roadmap for survivability evaluation and improvement. In addition, survivability analysis often results in recommendations for security and survivability policy definition or modification. The SNA method has been applied to a number of systems with good results. Customers have benefited from survivability improvements to system architectures, as well as from clarified requirements and early problem identification. Survivability is also the subject of ongoing research, as described, for example, in Fisher [8].

Figure 2: Sample Survivability map format
Figure 2: Sample Survivability map format
(Click on image above to show full-size version in pop-up window.)

Adding Survivability to System Requirements

Survivability properties can also be integrated into the requirements definition for new or evolving systems [9]. Figure 3 depicts an iterative model for defining survivable system requirements. Survivability must address not only requirements for software functionality, but also requirements for software usage, development, operation, and evolution. Thus, five specific types of requirements definitions are relevant to survivable systems in the model of Figure 3, as discussed below.

Figure 3: Requirements definition for survivable systems
Figure 3: Requirements definition for survivable systems
(Click on image above to show full-size version in pop-up window.)

System/Survivability Requirements

In this discussion, system requirements refers to traditional user functions that a system must provide. For example, a network management system must provide user functions for monitoring network operations, adjusting performance parameters, and so forth. System requirements also include nonfunctional aspects, such as timing, performance, and reliability. Survivability requirements refer to system capabilities for the delivery of essential services in the presence of attacks and intrusions, and recovery of full services.

Survivability requires that system requirements be organized into essential services and non-essential services, perhaps in terms of user categories or business criticality. Essential services must be maintained even during successful intrusions; non-essential services are to be recovered after intrusions have been dealt with. Essential services may be further stratified into levels with each embodying fewer and more vital services as a function of increasing severity and duration of intrusion. It is also possible that the set of essential services may vary in a more dynamic manner depending on a particular attack scenario and the resulting situation. In this case, services that are essential under one scenario may not be essential under another resulting in different combinations of essential services that are scenario-dependent.

Thus, definitions of requirements for essential services must be augmented with appropriate survivability requirements. As shown in Figure 3, survivable systems may also include legacy and COTS components not originally developed with survivability as an explicit objective. Such components may provide both essential and non-essential services and may engender special functional requirements for isolation and control through wrappers and filters to help permit safe use in a survivable system environment.

Beyond functional requirements, survivability itself imposes new types of requirements on systems for resistance to, recognition of, and in particular, recovery from intrusions and compromises. A variety of existing and emerging survivability strategies, noted in Table 1 support these survivability requirements.

Survivable systems are envisioned as capable of adapting their behavior, function, and resource allocation in response to intrusions. When necessary, for example, functions and resources devoted to non-essential services could be reallocated to the delivery of essential services and intrusion resistance, recognition, and recovery. Requirements for such systems must specify the behavior for adaptation and reconfiguration in response to intrusions.

Systems can exhibit large variations in survivability requirements. Small local networks may have few or even no essential services with acceptable manual recovery times measured in hours. Large-scale networks of networks may be required to maintain a core set of essential services with automated intrusion detection and recovery times measured in minutes. Embedded command and control systems may require essential services to be maintained in real time, with recovery periods measured in milliseconds. Attainment and maintenance of survivability consumes resources in system development, operation, and evolution. Survivability requirements for a system should be based on costs and risks to an organization associated with loss of essential services.

Usage/Intrusion Requirements

Survivable system testing must demonstrate the performance of essential and nonessential system services, as well as the survivability of essential services during an intrusion. Because system performance in testing (and operation) depends totally on the usage to which it is subjected, an effective approach to survivable system testing is based on usage scenarios derived from usage models.

Usage models are developed from usage requirements, which specify legitimate usage environments and all possible usage scenarios. Usage requirements for essential and nonessential services must be defined in parallel with system and survivability requirements. Furthermore, intrusion usage must be treated on a par with legitimate usage and intrusion requirements, which specify that intrusion usage environments and all possible scenarios of intrusion use must be defined as well. In this approach intrusion usage is modeled in conjunction with the legitimate use of system services. Intruders may engage in usage scenarios beyond legitimate scenarios, but may also employ legitimate usage for purposes of intrusion if they become privileged to do so.

Development Requirements

Survivability places stringent requirements on system development and testing practices. Software errors can have a devastating effect on survivability and provide ready opportunities for intruder exploitation. Sound engineering practices are required to create survivable software. The following five principles- four technical and one organizational-are example requirements for survivable system development and testing practices:

  • Precisely specify required functions in all possible circumstances of use.
  • Verify correct implementations with respect to function specifications.
  • Specify function usage in all possible circumstances of use, including intruder usage.
  • Test and certify based on function usage and statistical methods.
  • Establish permanent readiness teams for system monitoring, adaptation, and evolution.
Sound engineering practices are required to deal with legacy and COTS software components as well.

Operations Requirements

Survivability also places demands on requirements for system operation and administration to define and administer survivability policies, monitor system usage, respond to intrusions, and evolve system functions as necessary to ensure survivability as usage environments and intrusion patterns change over time.

Evolution Requirements

Lastly, system evolution is an inevitable necessity in response to users' requirements for new functions and intruders' increasing knowledge of system behavior and structure. In particular, survivability requires that system capabilities evolve more rapidly than intruder knowledge. This prevents the accumulation of information about invariant system behavior and structure needed for an intruder to achieve successful penetration and exploitation.

Summary

The emerging discipline of survivable systems is directed at maintaining essential mission operations in adverse circumstances that no amount of security precautions can guarantee to prevent. System survivability can be investigated and improved through the SNA method, and survivability can be integrated into system requirements on a par with functionality and performance. Survivability analysis is a prudent risk management technique in a world of increasing dependency on complex, large-scale network systems.

References

  1. Lipson, H.F. and Fisher, D.A. Survivability-A New Technical and Business Perspective on Security, Proceedings of the New Security Paradigms Workshop, IEEE Computer Society Press, 1999.
  2. Presidential Commission on Critical Infrastructure Protection, Critical Foundations-Protecting America's Infrastructures, The Report of the Presidential Commission on Critical Infrastructure Protection, October 1997, p. 173., Available at www.pccip.gov
  3. DARPA Information Survivability Program. Available at www.darpa.mil/ito/research/is
  4. Proceedings of the 1997 Information Survivability Workshop, San Diego, Calif., Feb. 12--13, 1997, SEI and IEEE Computer Society, April 1997. Available at www.cert.org/research
  5. Proceedings of the 1998 Information Survivability Workshop, Orlando, Fla., Oct. 28--30, 1998, SEI and IEEE Computer Society, 1998. Available at www.cert.org/research
  6. Ellison, R.J., Linger, R.C., Longstaff, T., and Mead, N.R. Survivable Network Systems Analysis: A Case Study, IEEE Software, July/August 1999, pp. 70-77.
  7. Ellison, R.J., Fisher, D.A., Linger, R.C., Lipson, H.F., Longstaff, T.A., and Mead, N.R. Survivability: Protecting Your Critical Systems, IEEE Internet Computing, November/ December 1999.
  8. Fisher, D.A. and Lipson, H.F. Emergent Algorithms-A New Method for Enhancing Survivability in Unbounded Systems, Proceedings of the 32nd Annual Hawaii International Conference on System Sciences, Maui, Hawaii, Jan. 5---8, 1999 (HICSS-32), IEEE Computer Society, 1999.
  9. Linger, R.C., Mead, N.R, and Lipson, H.F. Requirements Definition for Survivable Network Systems, Proceedings of International Conference on Requirements Engineering, IEEE Computer Society Press, Los Alamitos, Calif.,1998, pp. 14-23.


About the Authors
Robert J. Ellison

Robert Ellison, Ph.D. is a Senior Member of the Technical Staff in the SEI Networked Systems Survivability Program. His research interests include system survivability and architectural patterns and styles for security architectures. He has a doctorate in Mathematics from Purdue University and is a member of the ACM and IEEE Computer Society.



Nancy R. Mead

Nancy Mead, Ph.D. is senior member of the technical staff in the SEI Networked Survivable Systems Program, and a faculty member in software engineering, Carnegie Mellon University. She is involved in the study of survivable systems requirements and architectures, and the development of professional infrastructure for software engineers. Prior to joining the SEI, Mead was a senior technical staff member at IBM Federal Systems, where she spent most of her career in development and management of large real-time systems. She also worked in IBM's software engineering technology area, and managed IBM Federal Systems' software engineering education department.



Thomas A. Longstaff

Dr. Thomas Longstaff is senior member of the technical staff at the Software Engineering Institute and currently manages research and development in Survivable Network Technology for the Networked Systems Survivability Program. He is a member of CERT® Coordination Center and conducts analysis of vulnerability and security incidents and methods for assessing survivability. Previously he was technical director at the Computer Incident Advisory Capability at Lawrence Livermore National Laboratory, Livermore, Calif.



Richard C. Linger

Richard Linger is a senior member of the technical staff at the Software Engineering Institute's CERT® Coordination Center at Carnegie Mellon University. He teaches at the CMU H.J. Heinz School of Public Policy and Management. While at IBM, he founded and managed the IBM Cleanroom Software Technology Center. Linger has published three software engineering textbooks and more than 50 articles. He holds a bachelor's degree in electrical engineering from Duke University. He is member of the IEEE, ACM, the National Software Council, and vice-president of the Center for National Software Studies.

Software Engineering Institute
4500 Fifth Avenue
Pittsburgh, PA 15213



USAF Logo


Privacy and Security Notice  ·  External Links Disclaimer  ·  Site Map  ·  Contact Us