Practice:SAFETY sYSTEM

mbumu@yahoo.com

Contents

 * Abstracts
 * Introduction
 * Why reliable software May Not be safe
 * Safety Assurance
 * Software Safety pparadox
 * Case studies
 * Conclusion
 * Reference

Abstract
This paper tackles issues pertaining to safety-critical software .The introduction defines what this software is and sets boundary on area of coverage. It explains why reliable software may not be safe and outline methods of handling hazards. Specification an important part of the software lifecycle is discussed. Formal method is suggested, as been suitable method for safety-critical software. The need to consider safety at every stage is highlighted. Hazard and risk analyses are considered with Petri_net be discussed at length as a suitable method for analysis. Safety assurance is considered together with the safety argument concept. Process assurance is that considered for by having quality process then the resulting product would be of high quality. Some microprocessor self-test programs are also discussed. Two case studies of the Therac_25 and Ariane –5 are briefly outlined with emphasis on the causes of failure. A conclusion sums the report.

1.0	Introduction
The safety of system is the ability of a system to operate without causing injury to people and the environment. Critical systems are systems whose failure can cause economic losses, physical damage or threats to human life. Safety critical software is any software that has the potential to cause a hazard, or is required to support the control of hazard [NASA.2001].

Accident happens. That’s just part of life. But when mission critical or safety –critical systems experience failures due to failed software, serious question are raised. [Charles etal.2000]

The cost of failure is quite high; hence trustworthiness of development techniques and process is usually of paramount importance. The systems are developed using well-tried techniques and use of software engineering techniques that are not cost effective.

Safety critical software is either primary or secondary [Sommervile.2001]. In the former the software is embedded as a controller in system. Mulfunction of the software leads to injury. In the later, the mulfuntionimg may indirectly cause an injury. The paper concentrates on the primary.

Software can be intentional  made unsafe by intruders getting into the system. While this does happen the paper ignores this aspect and assumes that this can be best tackled as security feature than  safety.

Proffersor Knight in his Home page [Knight.1999]suggests that we should not trust  our   digitally controlled aircraft that we fly in, for  it is impposible to fully  develope safe software to drive the system. While this is true the paper looks at issues involved and means to increase the confidence of custormers in use of  safety critical software.

From heart defibrillators to avionics suites, from nuclear plant controls to the anti-lock braking system in most cars, microprocessors have become an integral part of everyday systems upon which millions of live depend. The safe operation cannot be taken lightly because of the consequence of failure. The paper highlights some self-test program that the microprocessor can perform.

2.0	 Why Reliable Software May Not Be Safe
Safety is affected by software specifications, hardware mulfunction and operator actions. The specification may not be complete and may not describe the required behaviour of the system in some critical scenario. Hardware mulfunction may present the software with a strange environment. The operator sequence of operation may produce unanticipated demand on the software causing failures.

Reliability is ameasure of the rate of failure in system that renders the system unusable, and safety  is ameasure of the absence of unsafe software conditions[Patrick etal.1993]. Reliability and safety are different system concepts:the former describes  how well  the system perform its functions and later states that the system functions do not lead to an accident. It’s important to differentiate the two concept because mistaking reliability for safety could lead to accidents.

Safety asurance can be achieved by ensuring that accident do not arise and if they do their effect is minimised. Sommerville state three ways which this can be achieved .These are; [Sommervillle.2001]

•	Hazard avoidance. •	Hazard detection and removal. •	Damage limitation.

Complex system are prone to accidents, because there are several events that may go wrong. Use of software makes the system more complex and accident prone. However, according Sommervile if software is correctly used ,it could improve safety due to the fact that: •	It makes more complexity possible. •	Is adaptable to different enivironment. •	Can provide sophisticated safety intelocks. •	Support control stategies which reduce the amount of time people spend in hazardous environment. I agree with him and add that the main advantage of using digital computers in control strategies is due to their programability. When a new control scenario errupts a new software is loaded while retaining the hardware intact. This reduces the time and cost required to  provide a solution to the new situation. Physical systems can be modelled using differential equations. Analogue computers are not, effective for non linear differential  problems while  software  based system  are.

Mandates dictating what language to use or methods to use also may lead to software being unsafe[SEI.2001].

3.0	 Safety Specification
During the requirement engineering process, potential hazards that might arise should be analysed to assess the risks they pose. The analysis results in specifications that describes how the system should handle the hazard. The process of safety  specification and assurance is part of safety life cycle. fig1.

The life cycle includes: •	Definition of the scope of the system. •	Hazard and risk analysis. •	 Safety requirement specification •	The planning and development activity- involves planning and implementation of the safety –critical system. •	Safety validation, installation, operation and maintenance of the system are planned.

The IEC 61508 safety Lifecycle fig1 [Sommerville –2001]

Safety validation is also carried before the system is put into use. Safety must also be managed during the operation and the maintenance of system. The system should be designed for maintability. Safety considerations that may apply during decommissioning are also considered.

In specification, the task the program is to accomplish is stated, but not how the program should be implemented. Any notation for writing formal program specification should facilitate abstraction. Whether software will exhibit the intended behavior depends on specification, implementation and verification. A conventional approach is to specify in English, program in a given language and verify by testing. It has been argued that such an approach is prone to all sorts of misunderstanding and errors, and can lead to incorrect software. The reasons for this are: •	The specification is wrong – ambiguous, vague, inconsistent and/or incomplete. •	The program is wrong- it does not do what the specifier intended. •	The verification is incomplete- the testing is not exhaustive

If this three –phase development process is so fraught with potential pitfalls, what can be done to improve it? Specifying in a formal notation rather than English holds the promise of bringing the most immediate benefits (Patrick etal.1994). Formal means written entirely in a language with an explicitly and precisely defined syntax and semantics. Mathematics is an appropriate basis for such a language. The advantages of using formal specifications are: •	Can be studied mathematically. •	Are easier to maintain than specification written in natural language. •	Are extensible. The notation can extend to include other features as the need arises. This method brings in the respectability of mathematics as precise notation with a long and well-established underlying theory, inspiring confidence in its rigor and expressive power. Augment for formal specification associated program specification is that formal specification forces a detailed analysis of the specification. Considering that cost of failure is massive these then should be the proposed method for safety critical software.

3.1 Hazard and Risk Analyses This involves the analysis of the system and it’s operational environment to detect hazard that may arise, the root cause of the hazard and associated risks. Sommerville says that the process is difficult, complex, requires lateral thinking and input from various experts. The process should be undertaken by experienced engineers in conjunction with domain experts and professional safety advisers. Group working methods should be utilized.

I do agree with him for it is impossible for one to be an expert in all areas. Being an expert means being knowledgeable in a narrow area. Peripheral problems require the input of others. People with prior experience in building similar systems may guide the software engineer in possible hazards so that he/she can focus on more productive activities instead of re-inventing the wheel.

The process is an iterative process involving: •	Hazard identification. •	Risk analysis and hazard classification. •	Hazard decomposition. •	Risk reduction assessment.

Identified hazard analysis should be carried out to discover the condition that might cause that hazard. Hazard analysis techniques can be either deductive or inductive. Deductive techniques start with hazard and work from that to possible system failure. Inductive technique starts with proposed system failure and identifies which hazards might arise. The analyses should include reviews, checklist, petri net analysis’s formal logic and fault-tree analysis. This paper considers the petri-net analyses only, for others method see Patrick etal 1993.

3.1 .1 Petri-Nets Petri-nets are a graphical technique that can be used to model and analyze safety-critical systems for such properties as reachability, recoverability, deadlock, and fault tolerance. Petri-nets allow the identification of the relationships between system components such as hardware and software, and human interaction or effects on both hardware and software. Real-time Petri-nets techniques can also allow analysts to build dynamic models that incorporate timing information. In so doing, the sequencing and scheduling of system actions can be monitored and checked for states that could lead to unsafe conditions [NASA .1999]. The Petri-net modeling tool is different from most other analysis methods in that it clearly demonstrates the dynamic progression of state transitions. Petri-nets can also be translated into mathematical logic expressions that can be analyzed by automated tools. Information can be extracted and reformed into analysis assisting graphs and tables that are relatively easy to understand. Advantages of Petri-nets over other safety analysis techniques are: •	Can be used to derive timing requirements in real-time systems. •	Allow the user to describe the system using graphical notation, and thus they free the analyst from the mathematical rigor required for complex systems. •	Can be applied through all phases of system development. Early use allows detection potential problems resulting in changes at the early stages of development where such changes are relatively easy and less costly than at later stages. •	Can be applied for the determination of worst case analysis and the potential risks of timing failures. •	A system approach is possible since hardware; software and human behavior can be modeled using the same language. •	Can be used at various levels of abstraction. •	Provide a modeling language that can be used for both formal analysis and simulation.

Unfortunately, it requires a large amount of detailed analysis to build even relatively small systems, thus making them very expensive. In order to reduce expenses, a few alternative Petri-net modeling techniques have been proposed, each tailored to perform a specific type of safety analysis. For example, time Petri-net, take account for time dependency factor of real-time systems; inverse Petri-net, specifically needed to perform safety analysis, uses the previously discussed backward modeling approach to avoid modeling all of the possible reachable status; and critical state inverse Petri-nets, which further refine inverse Petri-net analysis by only modeling reachable states at predefined criticality levels

Petri-net analysis can be performed at any phase of the software development cycle; though, it is highly recommended for reasons of expense and complexity that the process be started at the beginning of the development cycle and expanded for each of the succeeding phases. Petri-net, inverse Petri-net and critical state Petri-nets are all relatively new technologies are costly to implement, and absolutely require technical expertise on the part of the analyst.

3.1.2 Inverse Petri-Nets One particular use of Petri-nets which aids in the determination of software safety requirements is "Inverse Petri-net" analysis. During the requirement phase of development, a Petri-net is at a high level of abstraction and is usually only descriptive in nature. Hazardous events resulting from the Preliminary Hazard Analysis output are fed into the Petri-net. The analyst then works backward through the Petri-net to each state that could cause the hazard until the hazardous state at the software interface is reached. The backward approach is practical only if one considers a relative small number of high-risk states within the system. If the software interface does not handle the hazard in a manner that eliminates or reduces risk to an acceptable level, the software safety requirements at the interface need to be restated and re-implemented to offset the risk. The safety-augmented interface is then re-analyzed as to its impact or the rest of the system. The backward approach of the inverse Petri-net analysis reduces the work required to perform a safety analysis. It uses a proofing technique known as "proof by contradiction". This technique is of particular importance because it greatly reduces the amount of work required to perform safety analysis [NASA .1999]. Proof by contradiction proves that hazardous events do not occur, and states nothing else about the system.

3.2	Risk Assessment It involves considering the assessment of each hazard, the probability that it will arise, and probability that an accident will result from the hazard. The outcome of the assessment is a statement of acceptability, which could be: 1	Intolerable- the hazard should not arise and if it does it should not cause an accident. 2	As low as reasonably practical – design should minimize the possibility of the hazard occurring. 3	Acceptable – the risk can be tolerated so as not to increase the cost.

The estimation of the hazard probability depends on the engineering judgement. The severity are assigned relative term for it is difficult to assign quantitative value to each term. The implication is that novice engineer should not be assigned safety duties for they have not acquired the necessary judgement.

3.3	Risk Reduction Once the potential hazard and causes have identified the system specification should be formulated so that these hazard are unlikely to result in accident. The methods used are outlined 2.0. Sommerville recommend the use of combination of methods. I concur because redundancy has been found to increase reliability of system. By using more than one method would mean that if one method failed to stop hazard from taking place then the other method may reduce the effect of the accident.

4.0	Safety Assurance
Safety cannot be meaningfully specified in a quantitative way and so cannot therefore be measured when the system is tested. Safety validation involves establishing a confidence level in relative terms. This confidence could be based previous experience of the organization developing the system. According to Sommerville if a company has be previously developed a number of control system that have operated safely then it is reasonable to assume that they will continue to develop safe systems of this type.

This could be true but history of passed success can lead to apathy with disastrous consequence. The fact that National Aeronautical and space Administration. (NASA) had been involved with successful launches of spacecraft’s did not prevent explosion of one of their spacecraft on launching. Sommerville says that this assessment must be backed up by tangible evidence from the system design, the results of the system verification and validation and system development processes that have been used. I do concur, as this will make the company feels that apart from their reputation the customer will be interested in the process involved.

4.1	Verification and Validation The result of the tests is used as evidence, together with reviews and static checking, to judge the system safety. Sommerville suggest five reviews that should be mandatory for safety critical system. These are for:

•	Correct intended function. •	Maintable structure. •	Verification that algorithm and data structure design is consistent with specified behavior. •	Consistency of the code, algorithm and data structure designs. •	Adequacy of the system test cases.

Although it is no clear from Sommerville whether the reviews should be held in sequential or concurrently, I would propose that they be held sequentially. This will lead total focus on one issue at time and possibility of maximum throughput from the review. However any other defect could be noted and addressed at the right review.

1.2	Safety Arguments There is a continuing debate in the critical systems community about the role of formal methods, in safety of critical software development process. The use formal mathematical specifications and associated verification is mandatory standard for the United Kingdom defense safety –critical software. Many developers are not convinced and argue that they may reduce rather than increase system dependability.

Proofs of program correctness are recommended software verification techniques. It not normally used because of the practicability of constructing correctness proof. For safety critical system this is a requirement. [Sommerville.2001].

Sommerville suggests that where it may not be cost effective to develop correctness proof, safety argument may be developed. The safety argument demonstrates that the program meets its obligation. It is necessary to demonstrate that program execution cannot result in an unsafe state rather than prove that the program specification. The effective technique for demonstrating the safety of a system is proof by contradiction. It assumed that unsafe state could be reached by program execution. The code is analysed to show that the pre-conditions for this hazardous state are contradicted by the post-conditions of all program paths leading to this state.

Although this is not an ideal method I believe it goes a long way in improving the production process of safety critical software. For small companies where the cost of software production is prohibiting factor, using this method would result in the company breaking even. I also suggest that if this method were used together with formal methods it would cut the cost. However well established customers may not easily accept it.

4.3	Process Assurance

This is highly important for safety critical system due the fact that •	Accidents are rare in critical systems and it may be practically impossible to simulate them during the testing of a system. •	Safety requirements are ‘shall not’ requirements that exclude unsafe system behaviour. It is impossible to conclude through testing and other validation activities that these requirements have been met. During the development of safety critical system explicit attention should be paid to safety during all stages of the software process. The safety assurance activities that must included in the process are [Sommervile.2001]. •	The creation of a hazard logging and monitoring system that traces hazards from preliminary hazard analysis through testing and system validation. •	The appointments of project safety engineers who have explicit responsibility for the safety engineers whom have explicit responsibility for the safety aspects of the system. •	The extensive use of safety reviews throughout the development process. •	The creation of a safety certification system whereby safety critical components are formally certified for there assessed safety. •	Use of detailed configuration management. These activities if adhered to would result in safe software. I believe that if the process followed is high in quality then the product would be of high quality. Redudancy certification that is first party, second party and third party would boost the confidence of user have on a product.

There is need to use certified software engineers to enable them take the responsibility of the quality assurance. In other engineering discipline the quality engineer is normally certified. This not the case in software engineering field. I suggest this is because software engineering is not a recognised profession. It’s still in the craft stage. When it does mature, and the demand for quality product increases, then engineer will need certification. 4.4	Run Time Checking It is important that redundant code be added to checks the violation of safety constraint and generate an exception when this does occur. The safety constraint should be expressed as an assertion. Where assertions are predicates that describe conditions that must hold before the following statement can be executed .The assertion is generated from the safety specification and should be intended to assure safe behaviour rather behaviour that conforms to the specification.

5.0 Software Safety Paradox
Embedded systems running safety-critical application have a quandary. How can the software know that it’s operating correctly? Can a malfunctioning system diagnose itself and either correct itself or halt itself. Can the system perform a sanity check on itself? The answers to the questions are yes and the section will outline some methods. [Dough.1998 ]

5.1	CPU Self-test The test can be carried out when the CPU has not suffered a catastrophic failure. The self-test starts with simple instructions, verifying their correct operation and then proceeding to more complex instructions. The need to test individual instructions dictates that the CPU test must be written in machine assembler code. Because branching is the main cornerstone of any program, a simple unconditional branch would suffice. If the CPU cannot execute such instruction correctly then it’s unreliable. If branch instruction fails the then microprocessor should fail into failure code. This segment of failure code should be simple because if this segment is ever executed the CPU is unreliable.

As a minimum, a CPU self test should test each instruction that may be used in application: verify the processor’s ability to set, reset the flags, verify the correct execution of mathematical operations, finally check the addressability and read-write capability of each register. The register test checks for cross talk between individual bits, bytes, and words. CPU tests run on power-on following hardware initialisation and during background processing. CPU test also eliminates the need to test each microprocessor before installation.

As safety measure, all unused memory areas should be filled with halt instructions. A jump outside the program space would execute the halt instruction and trigger a trap. If the error occurs in debugging environment the trap could start a debugging function. If it occurs in a production environment the trap could vector into ‘safe the system ’ function. For a heart defibrillators the safe function would to disable the charger and transfer relays. For robotics system safe function should shut down the entire robot’s motors.

5.2	ROM Tests The test is used to verify the integrity of a program or data stored in ROM. It is basic safety precaution that should be performed on power on prior to executing the main applications. If the test fails, the main application should not be executed. The ROM test use cyclic redundancy checks (CRC) or checksum to guard against corrupted memory. Checksum is simple and fast; however multiple error can cancel each other out. A CRC has better integrity check but is slower.

5.3	Watchdog Timers This is circuit or function that counts down from a pre-set time until it expires or is reset. The executing software resets the watchdog at pre-set intervals to prevent its expiration; otherwise the processor is reset.

These timers are effective in cyclic systems where the watchdog is tickled each time through a loop. The non-determinism multitasking environment makes watchdog more difficult to implement.

5.4	Redundant Storage. The designer of safety critical software must consider threat that sometime, somewhere, a variable will be corrupted. If corrupted variable is critical parameter the result could be disastrous. Technique commonly used is to have three copies of critical variables stored in different memory type using different format. Prior to using the variable, the three copies are compared. If any of the three variables do not agree a two-of-three voting determines which values should be used. The disadvantages are additional time required to access the copies and perform the integrity checks. The second is the shear number of variables and hence the difficulty of keeping a copy of each. Only variables critical for safe operation should be kept.

6.0	Case Studies
6.1Therac-25 This was an infamous medical linear accelerator that massively overdosed six radiation therapy patients over a two yeas period [Charles etal.2000]. The cause of the accident was attributed to; overconfidence in software; confusing reliability with safety; lack of defensive design; failure to eliminate root causes, complacency; unrealistic risk assessment; inadequate investigation; inadequate software engineering practice; software reuse; and user and government oversight on standards.

From therac-25 I do suggest that eternal vigilance is the only way to prepare safety critical software.

6.2	Ariane-5 Ariane –5 was the newest in a family of rockets designed to carry satellites into orbit, on its maiden launch on June 4, 1996, it went into self destruction within 40 seconds. The payload of four satellites was destroyed. The core problem in Ariane failure was incorrect software reuse. A critical piece of software had be reused from the Ariane-4 system, but behaved differently in ariane-5 because of difference in operational parameters of the two rockets. During a data conversion from a 64-bit value to a 16-bit value overflow occurred, which resulted in operand error. This resulted in a series of event that lead to activation of the rocket self-destruct mechanism [Charles etal.2000].

Although software reuse is fashionable today, I suggest caution when using it on a different system. A comprehensive testing should be carried for the total system.

7.0	Conclusion
Most modern systems have software embedded in the controller. Failure of software leads to disastrous effects. Software use makes the system more complex and hence more accident-prone. However complexity of today’s system means effective control can only on be achieved via software. How then do we make it safe?

Defensive programming would go along way in achieving this. Formal method should be encouraged. Sommerville safety augment could be used in conjunction with formal methods. Since safety is not something that main easily quantifiable or tested then the process quality should be high. If safety issues are highlighted at each stage and configuration management enhanced then safe software could be realised. Certification of both the product and the personnel should be adhered to. The need to make software engineering a profession need not be over emphasised. This will demand certain level of training for professional recognition. If the resulting software is of high quality, the confidence of the users will rise and fear of digital systems decreases.

Software re-use should be treated with caution. Mandates should be discouraged and where this not possible then issuing authority should be incorporated in the product development. A subset of the programming language used should be free of error prone instructions.

8.0	Reference
[Sommerville.2001]	Ian Sommerville (2001) Software Engineering London, Pearson Education [Patrick etal]	Patrick R.H. Place & Kyo C. Kan (1993). Safety critical software. http://www.sei/cmu.edu/techical/ report Accessed on 8/02/2002

[Knight.1998]	Prof. Knight (1998) Safety in Nuclear Plant. http://www.cs.edu/bronchure/profs/knight.html Accessed on 8/02/2002 [NASA.2001]	Safety standards(2001). http://www.swg.jpl.nasa.gov/resources/SWG-Safety.html. Accessed on 8/02/2002 [SEI .2001]. Software upgrade workshop.(2001) http://www.sei/cmu.edu/publication/featured/techical.html Accessed on 8/02/2002 [NASA1999]	Safety analysis manual (1999). http://www.satc.gsfc.nasa.gov/assure/nss/1740.html accessed on 11/02/2002. [Dough.1998]	Solving the software paradox (1998 Mailto: ddbrown@physio-control.com	 Accessed 13/02/2002 [Charles etal 2000]	Charles Knutson, Sam Carmichael (2000). Safety First: Avoiding Software Mishaps http: www.cs.byu.edu/safety first avoiding software mishaps.htm Accessed 13/02/2002