This paper presents a survey of Common-Mode Failures in redundant systems. It discusses CMF in redundant systems, their possible causes, and techniques to analyze reliability of redundant systems in the presence of CMF. The paper augments earlier surveys on CMF in nuclear and power-supply systems.
This paper presents a survey of Common-Mode Failures in redundant systems. It discusses CMF in redundant systems, their possible causes, and techniques to analyze reliability of redundant systems in the presence of CMF. The paper augments earlier surveys on CMF in nuclear and power-supply systems.
Common-Mode Failures in Redundant VLSI Systems: A Survey Subhasish Mitra, Student Member, IEEE, Nirmal R. Saxena, Senior Member, IEEE, and Edward J. McCluskey, Life Fellow, IEEE AbstractThis paper presents a survey of CMF (common-mode failures) in redundant systems with emphasis on VLSI (very large scale integration) systems. The paper discusses CMF in redundant systems, their possible causes, and techniques to analyze reliability of redundant systems in the presence of CMF. Current practice and recent results on the use of design diversity techniques for CMF are reviewed. By revisiting the CMF problem in the context of VLSI systems, this paper augments earlier surveys on CMF in nuclear and power-supply systems. The need for quantifiable metrics and effective models for CMF in VLSI systems is re-emphasized. These metrics and models are extremely useful in designing reliable sys- tems. For example, using these metrics and models, system de- signers and synthesis tools can incorporate diversity in redundant systems to maximize protection against CMF. Index TermsCommon-mode failures, concurrent error detec- tion, data integrity, design diversity, redundancy. ACRONYMS 1 ALU arithmetic-logic unit ASIC application-specific integrated circuit CAD computer-aided design CASE computer-aided software engineering CCF common-cause failure CED concurrent error detection CMF common-mode failure EMI electro-magnetic interference FPGA field-programmable gate array HDL hardware description language IC integrated circuit s- implies the statistical definition TMR triple modular redundancy VLSI very large scale of integration I. INTRODUCTION R EDUNDANCY techniques are widely used for enhancing system reliability, availability and data integrity. Redundancy can either be temporal or physical. In temporal-redundancy, the same task is repeated multiple times and the final result is calculated using the individual results ob- tained from all the runs. For systems with physical-redundancy, Manuscript received November 1, 1999; revised April 1, 2000. This work was supported by the U.S. Defense Advanced Research Project Agency (DARPA) under Contract DABT63-97-C-0024 (Dependable Adaptive Computing Sys- tems [ROAR] project). The authors are with the Center for Reliable Computing, Stanford University, Stanford, CA 94305 USA (e-mail: {smitra; saxena; ejm}@crc.stanford.edu). Publisher Item Identifier S 0018-9529(00)11753-1. 1 The singular and plural of an acronym are always spelled the same. a module is replicated and the results from individual imple- mentations are used to calculate the final result. Duplication in the form of self-checking pairs (duplex systems) and TMR are classical examples of redundancy techniques. There is a large literature on redundancy techniques and on reliability modeling of systems with redundancy [43], [48], [50]. In a redundant system, CMF result from failures that affect more than one module at the same time, generally due to a common cause [30]. CMF can appear due to external (such as EMI, power-supply disturbances, and radiation) or internal causes. Design mistakes also constitute an important source of CMF. As stated in [4], although the use of redundant copies of hardware has proven to be quite effective in the detection of physical faults and subsequent system recovery, design faults are reproduced when redundant copies are made; thus, simple replication fails to enhance the fault tolerance of the systemwith respect to design faults. Common-mode (common-cause) fail- ures have been discussed extensively in publications related to safety and reliability of nuclear reactors and power supply sys- tems. It is well-known that CMF make the classical reliability expressions for redundant systems optimistic. As observed in [30], the addition of redundant modules is not a solution to CMF. The importance of these failures can be understood from the ob- servation [16]: The system unavailability may be increased by more than a factor of 10 in varying common cause contribution from zero to 1% [sic]. However, most of the publications related to this subject con- sider nuclear reactors and power supply systems. Very few of themconsider CMF in dependable computing systems designed using redundancy techniques. This paper surveys the work on CMF in redundant systems in general with special emphasis on computing systems. A natural component of the study of CMF is the study of di- versity. As early as 1970, diversity was identified as an effective antidote for CMF [25]. However, the major thrust was on having diversity in methodologies at various steps of the design of a nuclear reactor. Design-diversity was proposed in [4] to protect redundant computing systems against CMF. This paper reviews both prior art and recent results on design diversity. The main contributions of this paper are: surveying research on CMF; bringing into perspective the issues related to these failures in digital IC systems; presenting results from recent publications that help un- derstand design diversity in IC systems; addressing the issue of safety in redundant systems in the presence of CMF. 00189529/00$10.00 2000 IEEE Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on May 04,2010 at 22:34:42 UTC from IEEE Xplore. Restrictions apply. 286 IEEE TRANSACTIONS ON RELIABILITY, VOL. 49, NO. 3, SEPTEMBER 2000 Fig. 1. Example duplex system. Section II introduces the basic notion of CMF including the causes and a classification of these kinds of failures. Section III examines techniques to perform reliability analysis. Section IV discusses methodologies to handle these failures at various stages of the design process. Section V discusses metrics to quantify design diversity and techniques to design diverse redundant systems. II. COMMON-MODE FAILURES A common-mode failure (CMF) is the result of an event(s) which, because of dependencies, causes a coincidence of failure states of components in two or more separate channels of a redundancy system, leading to the defined system failing to per- form its intended function [53]. These types of failures have also been referred to as CCF [38]. The advantage of using the definition of [53] is: it defines CMF as those generated by a single source rather than those having exactly identical effects in the design. Thus, this definition holds for redundant systems where the copies are s-identical or different. For systems with s-identical copies, the CMF effect can be the same for both the copies; for nonidentical copies, the effects can be different. This paper uses this definition of CMF and calls these failures common-mode or common-cause failures interchangeably. The CMF problem is explained using the duplex system in Fig. 1. In the duplex system of Fig. 1, there are two copies of the same unit (Copy 1 and Copy 2). The outputs of the two units are compared. Any mismatch at the outputs of the two copies prompts a corrective action (maintenance, replacement with standby spares, etc.). If the failures in the units are s-independent, then the addition of simple redundancy like duplication reduces the system failure rate. However, for CMF, mere addition of redundancy might not help reduce the system failure rate. The following example illustrates this. Example: Let the probability that any one of the 2 copies fail be . If the failures are s-independent, the probability that the system fails (classical analysis assumes that the system fails when both copies fail) is . However, for CMF, due to a single cause, both copies can fail; if the probability of that cause of failure is , then the probability that both copies fail is rather than . Thus, simple addition of re- dundancy through replication does not help protect the system against these CMF. This example implicitly assumes that since the redundant copies are s-identical, the CMF effects are also s-identical. This motivates using diverse copies of the different modules in a redundant system. With diverse copies, it is pos- sible that the error effects of a particular CMF are different for each copy. Thus, there is a possibility of detecting the CMF and taking corrective action. Use of design diversity to protect redundant systems against CMF is discussed in more detail in Section V. A. Causes of CMF Extensive studies related to causes of CMF in nuclear reactors, power-supply systems, avionics, etc. have been con- ducted. In contrast, there are very few publications on CMF in VLSI systems. This paper focuses on causes of CMF in VLSI systems. Design faults constitute a major portion of CMF in replicated systems [4]; they can occur in the hardware or in the software. Design faults can be human-made (or due to the presence of bugs in the tools used, incorrect or incomplete or imprecise specifications, incorrect understanding, etc.) and are mainly introduced during the phase of creation of redundant systems [30]. Design faults can be permanent (hardware or software bugs) or intermittent (e.g., weak signals). CMF can also occur due to external disturbances when the system is operating. These kinds of disturbances include fluctuations in the power supply and radiation; the fault effects can be transient or permanent. In [9], power-supply disturbances were analyzed and it was shown that dips in the power-supply voltage cause delay faults in the circuits; these effects are transient. On the other hand, some literature claims that a single radiation source can cause multiple-event upsets in logic circuits and memories [46]. If the memory locations are written frequently, then these upsets have a temporary effect on the system. However, in SRAM-based programmable systems (e.g., field programmable gate arrays, FPGA), for example, upsets from radiation can have a permanent effect (unless the FPGA configuration is loaded again). In addition to these, CMF in information systems have been studied [10]; these CMF events are viruses (affecting applications, compilers, operating systems), power supply disturbances, manufacturer defects affecting compilers, OS, monitors, disk drives, RAM, power supply, etc. B. Classification of CMF The literature on CMF has proposed various classifications of CMF [25], [30], [53]. Reference [25] classifies CMF into 4 groups: functional deficiency, maintenance error, design deficiency, external event. Functional deficiency is not a hardware failure, but is a misapplication of hardware or an inability to predict the true nature of the system under consideration. A maintenance error Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on May 04,2010 at 22:34:42 UTC from IEEE Xplore. Restrictions apply. MITRA et al.: CMF IN REDUNDANT VLSI SYSTEMS: A SURVEY 287 Fig. 2. IC development flow and CMF. is defined as consistent mis-calibration or mis-service of all instruments monitoring a given system. Design deficiency is an unrecognized dependence on a single, common element or a common deficiency in all elements of a particular type. External events are failures resulting from disturbances in the external environment. The classification in [53] is somewhat similar to that in [25] and is mainly based on possible causes of CMF. The causes are split into two categories: engineering and operations. Design errors, incomplete design specifications, inadequate instrumen- tation, lack of standards, inadequate testing, etc., are in the engi- neering category. CMF arising from imperfect repair, operator error, environmental condition, etc., are in the operations class. This kind of a classification is useful for identifying the causes of CMF. Although these classifications are related to nuclear re- actors and power supply systems, they extend fairly well to IC systems, as described next. Fig. 2 shows the flow of the IC development process: 1) There is a specification of the anticipated functionality of the IC that should be designed. 2) Given the specification, the designer designs the IC. 3) The developed design is fabricated on silicon. 4) The fabricated parts are tested and shipped to the customer. CMF can be generated at different points of IC development; potential CMF causes at different levels must be handled in dif- ferent ways. This section briefly addresses these causes (see Fig. 2). Apotential source of CMF is that specifications are often am- biguous or incomplete. Even with correct specifications, CMF can be generated during the design phase. This includes bugs in the CAD tools that are used for designing the IC chips and incorrect interpretation and human errors (unintentional or in- tentional due to sabotage) incurred during the design process. Incomplete design verification is a potential source of CMF at this stage. In the fabrication stage, CMF are introduced due to inaccu- racies in the manufacturing process, leading to manufacturing defects. Some of the defective chips can be screened by thoroughly testing the manufactured parts. However, due to inadequate testing and low fault-coverage, some defective and weak chips (that can cause early-life failures) can escape. In the field, radiation, EMI and power supply disturbances can cause CMF. This kind of CMF classification identifies the possible sources and the steps needed to eliminate CMF. In an IC devel- opment project, depending on experience and past data, each stage can be improved so that the introduction or occurrence of CMF is minimized. However, there is a cost associated with each of the CMF elimination steps. The challenge is to apply CMF elimination techniques cost-effectively. CMF can also be classified based on other properties. For ex- ample, it might be of concern whether a particular CMF affects the system only temporarily or permanently [30]. Reference [30] classifies CMF according to their nature, origin, and persis- tence. The nature of the CMF can be accidental or intentional. The origin of CMF can be due to some adverse physical failures or human made (e.g., operator errors) or because of imperfect specifications (e.g., design errors). The CMF can persist tem- porarily or have a permanent effect on the system. Thus [30], any CMF belongs to 1 of the categories: transient external, permanent external, intermittent design-fault, permanent design-fault, arising from interaction of the system with the external environment (e.g., operator errors). CMF can be classified according to their effects on the system: catastrophic (life critical support, fly-by-wire), noncatastrophic (e.g., on-line transaction processing the data can be resent), negligible. CMF can also be classified as: 1) CMF that do not affect the system output; the system output is always correct. 2) CMF that can be detected in a redundant system through disagreement of multiple modules; however, it is not guaranteed that the system will always produce correct outputs. 3) CMF that have s-identical effects on different modules of a redundant system; thus these CMF are not detectable Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on May 04,2010 at 22:34:42 UTC from IEEE Xplore. Restrictions apply. 288 IEEE TRANSACTIONS ON RELIABILITY, VOL. 49, NO. 3, SEPTEMBER 2000 Fig. 3. Fault-tree analysis. through comparison of the outputs of the different modules. 4) CMF in the presence of which the duplex system ei- ther produces correct output or detects disagreement prompting a repair or corrective action. For safety critical systems, cases 2 and 3 are not desirable. For other systems, where tests can be applied periodically, CMF in case 2 might be acceptable. However, CMF of case 3 are not desirable in any system. III. RELIABILITY ANALYSIS WITH CMF This section surveys techniques for estimating the reliability of redundant systems in the presence of CMF. Research in this area has focused on developing models for reliability analysis. Techniques for analysis of fault trees with CMF are in [15], [37], [44]. A summary of techniques is in [54]. Ref [38] advo- cates explicit inclusion of different events for each component in a common-cause group that fails all the members of the group. This is illustrated using the fault-tree for a simple TMR system. Consider a TMR system with three modules . To consider common-cause failures involving component , [38] modifies the fault-tree of component , as shown in Fig. 3. Thus: total failure of component A, failure of component from s-independent causes, failure of components (but not ) from common causes, failure of components (but not ) from common causes, failure of components from common causes. The basic events that cause system failure are: Different probability values have to be assigned to these indi- vidual events to estimate the probability of failure of the TMR system. This is the Basic Parameter Model. Other models Fig. 4. Reliability modeling of a parallel system with CMF. (Beta factor model, Multiple Greek Letter model, etc.) are ex- plained in [38]. All these models try to address the issue of incompleteness of data used to estimate the individual failure probabilities. Beta Factor Model: A preliminary reliability analysis of redundant systems, considering CMF, can be obtained from [11]. Reference [11] assumes that the failure rate of a simplex system is, . failure rate of s-independent failures, failure rate of CMF. Reference [11] assumes: is determined fromexperimental data or experience. With this background, a parallel systemof redundant modules is modeled. In a parallel system, the system performs the intended operation as long as at least one of the redundant modules is fault-free. In Fig. 4, is the reliability of each individual module with respect to s-independent failures, and is the reliability with respect to CMF. This analysis assumes that, due to the presence of CMF, the system can not perform its intended function. Thus, the system works correctly (performs intended function) when: a) there is no CMF affecting the system, and b) at least 1 of the redundant modules works correctly in the presence of s-independent failures. Thus, the system reliability is: When , then and the system acts as an ordinary parallel system without any CMF. When , then the system acts as a simplex system. For this model, the real problem is in estimating , which can be ex- tremely difficult. While this model might seem to be reasonable for replicated systems, it is not clear howeffective it is for redun- dant systems with nonidentical copies. This modeling technique has been extended to calculate the reliability of other systems in the presence of CMF. Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on May 04,2010 at 22:34:42 UTC from IEEE Xplore. Restrictions apply. MITRA et al.: CMF IN REDUNDANT VLSI SYSTEMS: A SURVEY 289 The beta-factor model assumes that the system becomes nonfunctional (or does not function correctly) once failures occur. This is not necessarily true for digital systems. For these systems, the CMF can have several error effects in the copies (especially if the implementations are different) and the errors at the outputs might not always occur simultaneously at the same cycle. For example, in a duplex system, the failure effect can be such that the 2 modules do not produce the same set of erroneous outputs simultaneously; it is then guaranteed that the system either produces correct outputs or indicates error. However, if the 2 modules are s-identical, then chances are high that the CMF has the same effect in both modules. As an example, consider a CMF design fault: if both copies are exact replicas of each other, they are affected in the same way. In this case, both copies produce the same set of erroneous outputs simultaneously; thus, there is no way the presence of error can be detected. On the other hand, there can be CMF that cause the system to behave in such a way that there exists at least one set of inputs for which the failure is detected (from the error signal produced by the duplex system). The reliability modeling technique in [35] addresses these problems. This allows treating redundant systems (with s-identical or noniden- tical copies) in a uniform way and derives simple relationships among mission times, failure rates, and the characteristics of the response of individual modules to failures. Section V describes the modeling technique. Other papers on reliability analysis in the presence of CMF include [12][14], [20], [22], [28], [51], [52]. Reference [40] states that most of the models have very little relationships with the possible CMF causes. The binomial failure-rate model pro- poses a mechanism in which agents, called shocks, impact all components in the group. In [51], [52], the CMF have been mod- eled as external shocks having constant occurrence-rates; in par- ticular, a multivariate exponential model has been assumed to perform the reliability analysis in the presence of CMF. Reference [20] gives a trinomial failure-rate model for relia- bility analysis in the presence of CMF. In this model, any system component is in 1 of 3 states: 1) success: the component is working correctly, 2) failed: the component has failed, 3) gray: it is too ambiguous to be declared as a failed or a success state, such as a partial failure, potential failure or incipient failure. Other models for correlated failures and CMF are discussed in [28]. Although many models have been proposed for reliability analysis of CMF, for IC and computing systems, there is not enough real experimental data to demonstrate their effective- ness. There has been some initiative in this direction for an- alyzing CMF in nuclear reactors. Thus, real experiments and CMF models are necessary for progress in research on protec- tion of redundant IC systems against CMF. IV. TECHNIQUES TO HANDLE CMF As mentioned Section II, CMF are considered as a poten- tial source of problems in redundant systemsnuclear reactors, power-supply systems, avionics and redundant VLSI and com- puter systems (hardware and software), etc. There are many ap- proaches for handling CMF in redundant systems [30]. These approaches include CMF avoidance, CMF removal, and CMF tolerance. CMF avoidance techniques are applied during the 3 phases: 1) specification, 2) design, 3) implementation. The CMF-removal techniques are applied mainly during the test and validation phases, while the CMF-tolerance techniques are primarily intended to handle CMF while the system is in oper- ation. This section explains each of these techniques in detail. It is not known yet, whether these techniques cover all possible CMF sources exhaustively. The relative coverage of these indi- vidual techniques is a subject of further research, and thorough experimental data are needed to estimate the coverage numbers. Reference [30] argues that the occurrence-probability of CMF that are not covered in any of these three stages might be of the same order as the probability of multiple random faults in the multiple modules. Reference [30] uses the following argument to quantify this claim. Consider a pessimistic arrival rate of /hour. If a 99% CMF coverage is obtained at each of 3 phases (specification, design, implementation), then the probability that a CMF is not detected by any of these techniques is . Thus, the proba- bility that a system provides faulty outputs in the presence of a CMF is of the order of . This number is obtained by mul- tiplying the probability of CMF occurrence and the proba- bility that the CMF is not detected in any of the 3 phases. This shows orders of magnitude improvement in protection against CMF. A. CMF Avoidance Steps to avoid CMF must be adopted from the very begin- ning of the design and development processes, because, during all project stages, CMF (or situations that can lead to CMF in the future) can be introduced. Later in the project, it might be difficult to detect CMF (or situations that can lead to CMF) in- troduced at the earlier stages. The main aim of CMF avoidance techniques is to reduce the number of permanent and intermit- tent design CMF introduced in computer systems. Seven CMF avoidance techniques are listed here; many of them also appear in [30]. 1) Mature and Verified Components: The importance of reuse for building redundant computing systems is stressed in [30]. By using components which have been verified formally (microprocessors, operating system kernels, etc.) and stable products that have been extensively tested (and verified), the probability of design-flaws can be reduced. There is a very high chance that design flaws are introduced if one begins designing everything from scratch. 2) Conformance to Standards: While the standards are mainly meant to ease interoperability of various techniques, logistics, maintainability, etc., they can also reduce design errors. This is because design errors are often introduced due to incomplete, ambiguous, and/or incorrect understanding of how Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on May 04,2010 at 22:34:42 UTC from IEEE Xplore. Restrictions apply. 290 IEEE TRANSACTIONS ON RELIABILITY, VOL. 49, NO. 3, SEPTEMBER 2000 different systems operate and interact with other systems. Con- formance to standards reduces the probability of design errors arising from ambiguous interpretation of system operation. 3) Use of Formal Methods: Formal methods can be used for specifying, developing, and verifying computer systems with strong emphasis on consistency, completeness, and correctness of the properties. Formal specification and verification tech- niques have been used for such systems. However, the major concern regarding these formal methods is: these techniques do not scale proportionately with increasing complexities in the de- sign (e.g., size) and often explode (in time and memory space) for large designs. 4) Design Automation to Eliminate Human Errors: Refernce [30] advocates the use of design-automation tools to automate parts of the hardware and software design cycles. With automated tools, the probability of human errors can be reduced. CASE tools and hardware design automation tools can be used for this purpose. However, complete design automation can lead to design bugs that might be overlooked. In this context, [30] discusses the design methodology, integrated with formal methods, adopted in the Draper Laboratory. 5) Performance CMF: Performance CMF arise mostly in real-time systems. In the presence of these failures, the system fails to deliver the required services on time under various work- load conditions. To avoid these kinds of performance CMF, it is important to develop an accurate, complete model of the system; with such a model, analysis (through simulation of benchmarks and numerical calculations) should be performed a priori, to find out whether timing faults can appear in a system under var- ious conditions. 6) Design Rules and Design Techniques: Design rules are important in the design of VLSI systems, and are mainly guided by the capabilities (precision) of the underlying fabri- cation process, signal integrity, electromigration problems, etc. Design rules for reducing chances of CMF (e.g., increasing the spacing between two signal lines) can possibly be devised. However, the resulting design rules can be too conservative and might not be suitable for achieving high signal-speeds in high-performance VLSI systems. Design techniques, like shielding and radiation hardening, can be used to avoid failures caused by the external environ- ment when a system is used in the field. The main drawbacks are that these techniques add extra appreciable development and manufacturing cost and increase the development time. 7) Design Diversity: Design diversity is an avoidance technique as well as a tolerance technique for CMF. It is an avoidance technique for design faults, and a tolerance technique for other kinds of faults. The concept behind design diversity is to implement various copies in a redundant system in different ways, starting from a common set of specifications. It applies to all levels: hardware, software, programming language, design development environment, etc. This approach can eliminate many common-mode design faults since each redundant copy uses a different design. However, incorrect interpretation of ambiguous specifications can still lead to faults in multiple copies; thus, design diversity cannot provide 100% coverage of all design faults [30]. From the viewpoint of design faults, mature verification techniques can be more useful in avoiding CMF arising from design faults. However, diversity might help in tolerating some other kinds of CMF in the field. With appropriate diversity, modeled failures in the field can have different effects on the different copies with diversity. Design diversity has some costs associated with it. By defi- nition, it is required to design a given module at least twice to achieve diversity; thus, the extra development time is an extra cost. In addition, the two designs must be manufactured and this increases the manufacturing cost. For example, one might have to manufacture two different ASIC, for diversity. However [35], with reconfigurable computing systems, the costs associ- ated with diversity can be reduced. For implementing diversity on reconfigurable hardware (like Field Programmable Gate Ar- rays), one synthesizes and downloads different configurations. Thus, there is no need to manufacture two different ASIC. The design cost can be reduced with use of CAD tools for synthesis, placement and routing of designs on FPGA. Thus, the paradigm of reconfigurable computing can be regarded as an enabling technology for design diversity. Design diversity is explained in more detail in Section V. B. CMF Removal The CMF avoidance techniques, in Section IV-A, are not fool-proof. Thus, the faults that slip past the design process must be detected and removed in the later stages of system development. CMF removal techniques include design reviews, extensive simulation/verification, testing, and fault injection. Design reviews, simulation, and verification are mainly meant for removing design faults which constitute a major fraction of CMF. Testing, on the other hand, detects mainly manufacturing defects and weak chips. Testing and fault injection in the following paragraphs. 1) Testing: Testing is performed mainly to screen out chips with manufacturing defects. There is large literature on testing techniques [1], [39]. To ensure high quality, chips are required to work properly not only at the time of production, but also throughout the anticipated lifetime. Hence, screening techniques are also aimed at identifying weak parts. These are the chips that work correctly just after being manufactured, but have some latent defects. As a result, these parts fail can in the field as early-life failures. There are many ways to identify these weak parts. For VLSI systems, burn-in is common practice to ensure high reliability of the chips that are shipped to the customers. By exercising the chips at high temperature and/or high supply voltage, burn-in screens out chips with defects that can cause early life failures and reliability problems [23]. However, burn-in is very expensive and hence finding alternatives to burn-in is a very important research problem. Some of these techniques are Iddq testing [18], VLV (very low voltage) [21], and SHOVE (Short voltage elevation) [6] testing that can detect certain classes of defects that cause reliability failures. 2) Fault Injection: Fault injection inserts faults in an other- wise fault-free system (designed to tolerate faults) to evaluate the systems ability to tolerate these kinds of faults in the real environment. Faults can be injected either in a system proto- type or in the software simulation model of the system. Fault injection enables studying the behavior of a redundant systemin Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on May 04,2010 at 22:34:42 UTC from IEEE Xplore. Restrictions apply. MITRA et al.: CMF IN REDUNDANT VLSI SYSTEMS: A SURVEY 291 the actual environment. For fault-injection to be a success, thor- ough studies of all failure mechanisms and modes that occur in real-life (through experiments or actual field data) are needed. As mentioned in [30], fault injection techniques can also be used to operate the system in various degraded modes which the system can encounter in the real life. There are several ways to perform fault injection [8], [24]. Fault injection can be performed in software or in hardware. For software fault-injection, errors are injected into the HDL model of the system, and simulation can be performed to see the response of the system to the errors introduced. The errors introduced can be in the function of the module or in the netlist of a particular module (if such a netlist is available). Thus, it is possible to simulate the system-level manifestation of gate-level faults. DEPEND[17] is an example of such an integrated design and fault injection environment. For hardware fault-injection, the system (or a prototype) is first built and errors are introduced in the hardware. These can include disturbance of signals on the pins of the circuit, putting the chips under heavy-ion radiation [19], [27], [32], or power supply disturbances [9]. For validation and verification, emulators are often used [45]. An emulator contains programmable elements. One way of implementing these programmable elements is to use FPGA. A HDL model of a given system is mapped to the FPGA in the emulator, and the emulator can be connected to the environment in which the system will run in real life. The advantage of emulators is the emulation speed (several orders of magnitude improvement over simulation time). Emulators can also be used to inject faults in the programmable elements (change the configuration of the FPGA) and evaluate the effect of the fault on the system. C. CMF Tolerance CMF can manifest itself as transient or permanent faults due to external causes like environmental disturbances, power supply disturbances, and radiation. These failures occur in the field and the only way to handle them is to detect them in the field and take corrective actions once the failures are detected. This is why CMF tolerance and recovery are very important. For CMF detection, one can use watchdog timers, exceptions handlers, run-time checks, and presence tests [30]. Concurrent error detection can also be used to make systems secure against CMF. For example, consider the duplex system in Fig. 5; a CED circuit (CED1 or CED2) is associated with each module; if a CMF affects the two modules (Copy 1 and Copy 2), then the individual concurrent error detectors might be able to detect it and report an error. If a common-mode failure affects a partic- ular module and its concurrent error detection circuit (Copy 1 and CED1, for example), then the comparator circuit can detect it. Design diversity can be used to detect CMF in redundant systems. For example, consider a duplex system with 2 s-identical copies of the same hardware. A common-mode failure that affects the s-identical leads of both the copies will never be detected. However, with diversity it is possible to detect many CMF in multiple copies. Use of a hardware implementation and its dual in recommended in [49]. Reference Fig. 5. A duplex system with CED. [35] shows, by theory and simulation, that for CMF there is a distinct advantage in using diversity for detection purposes. Reference [36] presents techniques to synthesize redundant systems that detect modeled CMF. These results are discussed in Section V-B. Recovery from CMF is tied very closely to CMF fault tolerance. Once a CMF is detected, it is necessary to restore the system-state to a previously known correct point from which further computation can resume. This translates to deciding checkpoint intervals when the system state is saved in nonvolatile memory, e.g., hard disks for which the RAIDS (Redundant Array of Inexpensive Disks) architecture [42] can be used. When the error is detected, the system can be rolled back to the checkpointed state. Checkpointing and recovery are inter-related. Checkpointing and recovery schemes are described in [43]. V. DESIGN DIVERSITY In design diversity, the hardware and software elements that are used for multiple computations (in a redundant system) are not just replicated, but are s-independently generated to meet system requirements [4]. The basic idea is that, with redundant systems, it is possible to tolerate s-independent physical faults. However, to tolerate design faults, then direct replication of the copies in a redundant system is of no help; the same design fault is reproduced in all 3 copies; hence, the system fails in response to an input that invokes this design fault. However, if the designs are generated s-independently (e.g., by different designers and design-tools),chances are low that the exact same design fault appears in all 3 copies. Reference [4] gives 3 conditions for the s-independence of design faults. 1) Different algorithms, programming languages, transla- tors, design automation tools, machine languages, etc., should be used. 2) s-Independent programmers or designers, with diversity in their training and experience, should be used. 3) The most critical condition is the existence of a complete, correct initial statement of requirements that should be satisfied by all the diverse designs. The use of formal Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on May 04,2010 at 22:34:42 UTC from IEEE Xplore. Restrictions apply. 292 IEEE TRANSACTIONS ON RELIABILITY, VOL. 49, NO. 3, SEPTEMBER 2000 methods of requirements specification in order to achieve the third goal is necessary [4]. Design diversity has been used in the context of -version programming to handle design faults efficiently. Experimental results supporting the use of diversity in -version program- ming are in [4]. Reference [29] observes through an experiment, that sometimes independently generated programs may be in- dividually extremely reliable but in a large number of cases, more than one of the programs can fail. Thus, the claim [4] is that -version programming must be used with care since there might not be any reliability improvement from diversity. There are other diversity approaches for handling CMF in software. A comprehensive report of these techniques is avail- able in [26]. A method to avoid s-identical errors caused by de- sign faults in multiple computing systems by diversifying the input data space is in [3]; the claimis that data diversity requires an algorithm for the re-expression of input data, but does not re- quire design diversity. Reference [7] observes that data diversity can be useful for certain applications. Functional diversity is another technique for handling CMF in redundant systems. Functional diversity exploits the fact that some problems have multiple ways (e.g., multiple algorithms) of achieving the same result. Details of functional diversity are in [2]. The main idea is that, with functional diversity, the system may fail to achieve a goal in its standard way, but may be able to reach that goal in some different way. Functional- richness of the system is one of the necessary conditions to achieve functional diversity. Functional richness is a property of a system such that it is possible to achieve the same end-result in several different ways using that system. Diversity is not confined to redundant software systems. As early as 1970, in a study on CMF in nuclear systems, diversity was observed to be a common antidote for CMF [25]. Five kinds of diversity are classified in this work: functional diversity, oper- ational administrative diversity, design administrative diversity, equipment diversity, and physical diversity. Functional diversity provides protection against design deficiency, maintenance errors, and external sources. Operational administrative diversity requires different persons to do certain tasks or a second person to check on the first. Equipment diversity provides different equipment (possibly of different precision) to measure the same parameter. Physical diversity relates to physical separation of instrumentation components measuring the various key parameters. Reference [34] uses diversity in the context of time redundancy. Instead of structural redundancy, sequential execution of various implementations of a software task on single computer is proposed to detect software and hardware faults in a safe system. This technique has been termed: systematic diversity. Techniques for designing redundant systems protected against CMF are in [36]. The techniques of RESO [41] and RERO [31] are also in this category. Hardware design diversity has been used to design redundant hardware systems. Examples of systems using hardware design diversity include: the Primary Flight Computer (PFC) system of Boeing 777 [47], the space shuttle, Airbus 320 [5], and many other commercial systems. For the Boeing 777, three processors with different architectures (from AMD, Intel, Motorola) are used in the PFC system. A. Quantifying Design Diversity Diversity can bring benefits to a redundant system; however, these benefits are extremely difficult to quantify. Moreover [29], not all kinds of diversity are useful; there are several instances in which many of their 27 versions of the same software program shared common faults. Thus [33], there is a need to answer ques- tions such as: What is diversity? Are these designs more diverse than those? How diverse are these two designs? In the literature, these questions are not answered clearly; the need for answering these questions is also expressed in [49]. Reference [33] tries to answer some of these questions. It used probability analysis to reach the conclusions: Suppose, we know that components and are different, but we are indifferent between 1-out-of-2 systems consisting of compo- nents and components only, then we should always build a 1-out-of-2 system made of component and component [sic]. Reference [33] claims that this observation can be generalized for a 1-out-of- system. However, [33] assumes that, given a particular environment in which the components operate, the probabilities of failure of the components are independent, which might not be valid in general. Reference [35] presents a metric for quantifying diversity among different designs. Assume that we are given two imple- mentations (logic networks) of a logic function, an input proba- bility distribution, and faults and that occur in the first and second implementations, respectively. The diversity with respect to the fault pair is the conditional probability that the two implementations do not produce identical errors, given that faults and have occurred. For a given fault-model, the design diversity metric, , be- tween two designs is the mean value of the diversity with respect to different fault pairs: where is the probability of the fault pair . 1) Example 1: Consider any combinational logic function with inputs and 1 output. The fault model assumes that a com- binational circuit remains combinational in the presence of the fault. Consider 2 implementations of the given com- binational logic function. The joint-detectability, , of a fault pair is the number of input patterns that detect both and . This definition follows from the idea of detectability developed in [55]. Assume that all the input patterns are equally likely, then: Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on May 04,2010 at 22:34:42 UTC from IEEE Xplore. Restrictions apply. MITRA et al.: CMF IN REDUNDANT VLSI SYSTEMS: A SURVEY 293 TABLE I BEHAVIOR OF FAULTY MULTIPLE-OUTPUT CIRCUITS The generate a diversity profile for and with respect to a fault model. Consider a duplex system consisting of and . In response to any input combination, and can produce 1 of 3 cases at their outputs. 1) Both of them produce correct outputs. 2) One of them produces the correct output and the other produces an incorrect output. 3) Both of them produce the same incorrect value. For case 1, the duplex system produces correct outputs. For case 2, the system reports a mismatch so that appropriate re- covery actions can be taken. For case 3, the system produce an incorrect output without reporting a mismatch; thus the integrity of the system is lost due to the presence of faults in and . In the literature on fault-tolerance [43], [48], this system in- tegrity is referred to as the fault-secure property. If all fault pairs are equally probable and there are fault pairs , then for and is: 2) Example 2: Extend example 1 to consider multiple-output combinational logic circuits. For an affecting and , define as the number of input patterns, in response to each of which, both and produce the same erroneous output pattern. Use the same formulas as in example 1. For example, consider a combinational logic function with 2 inputs and 2 outputs (Table I). Let and affect and , respectively. Table I shows the responses of and in the presence of the faults; the faulty output bits are highlighted in columns 3 and 4. To calculate , consider only the input patterns 10 and 11. This illustration of can be extended to sequential circuits and software programs. For small or medium-sized systems, the exact value of can be calculated manually or using computer programs. For large systems, the value can be estimated with simulation techniques. 3) Reliability Analysis Using the Design Diversity Metric: Reliability analysis for redundant systems has been performed in [35] using the design diversity metric. is useful because it is closely related to the causal structure of the CMF; as mentioned in Section III, this causal structure is missing in all the previous work on reliability analysis of CMF. Theoretical analysis using and simulation results lead to the two conclusions about diversity [35]. 1) For s-independent multiple-module failures, mere use of different implementations does not always guarantee higher reliability compared to redundant systems with s-identical implementations. It is important to analyze the reliability of redundant systems using . 2) For CMF and design faults, there is a important gain in using different implementations. However, our anal- ysis shows that the gain diminishes as the mission-time increases. Our simulation results demonstrate the useful- ness of diversity for enhancing the self-testing properties of redundant systems. B. Designing for Diversity: Synthesis Problems This section discusses 2 categories of techniques that can be used to achieve sufficient diversity in a given hardware or software system: A) Techniques that do not consider any fault model, B) Techniques that consider an underlying fault model. The concept of -version programming was proposed in [4] to achieve diversity in software systems. This technique can be used at various levels of abstraction. For example, entirely different algorithms can be used for performing a particular computation. On the other hand, different implementations of the same algorithm (possibly by different s-independent pro- grammers) can achieve diversity. However, it is not easy to quan- tify the diversity that is really obtained with the category B[29]. This paper has already discussed the commercial use of hard- ware design diversity as used in Boeing 777 [47], the space shuttle, Airbus 320 [5], etc. All these are examples of imple- menting diversity without any underlying fault model. We hope that, with different implementations, the errors in the different copies will be different. Although the majority of the design diversity techniques in computing systems are focused on techniques that do not rely on any fault model, some work has been done on diversity with some underlying fault model in mind. RESO (Recomputation using Shifted Operands) [41] and RERO [31] are two such error detection techniques that are targeted toward ALU using the concept of time redundancy. In RESO, during the initial com- putation, the operands are passed to the inputs of the ALU and the result is stored in a register. During the recomputation step, the operands are shifted left (by bits) and then applied to the inputs of the ALU under consideration. The computed result is right shifted, and then compared with the result obtained from the initial step (stored in a register). If these two results mis- match, an error signal is turned on. Reference [41] shows that for most practical ALU implementations, RESO detects all er- rors caused by faults in a bit-slice or a specific subcircuit of the bit slice. RESO has been extended to RERO (Recomputation using Rotated Operands) [31]. For duplex systems using hardware redundancy, Tohma pro- posed to use the implementations of logic functions in true and complemented forms [56]. The use of a particular circuit and its dual was proposed [49] to achieve diversity in order to handle CMF. The basic idea is, with different implementations, failures that affect the two circuits in the same way will probably cause different error effects. Ref [36] introduces a common-mode fault model involving the register bits of the inputs of the indi- vidual copies, and proposes synthesis techniques for designing redundant systems that are protected against the modeled CMF. Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on May 04,2010 at 22:34:42 UTC from IEEE Xplore. Restrictions apply. 294 IEEE TRANSACTIONS ON RELIABILITY, VOL. 49, NO. 3, SEPTEMBER 2000 Conventional (high-level, logic, layout) synthesis techniques can be adapted to generate multiple designs such that the design diversity metric in Section V-A can be maximized. Adapting these synthesis techniques for generating diverse designs leads to interesting (and important) open problems for researchers in this field. For example, in [57], a technique for synthesizing diverse combinational logic circuits has been described. This technique maximizes the data integrity of the resulting diverse duplex system against multiple failures and CMF while minimizing the area overhead. REFERENCES [1] M. Abramovici, M. A. Breuer, and A. D. Friedman, Digital Systems Testing and Testable Design, 1990. [2] R. L. Abbott, Resourceful systems for fault tolerance, reliability and safety, ACM Computing Surveys, vol. 22, no. 1, pp. 3568, 1990. [3] P. E. Ammann and J. C. Knight, Data diversity: An approach to software fault tolerance, in Proc. Int. Symp. Fault-Tolerant Computing, 1987, pp. 122126. [4] A. Avizienis and J. P. J. Kelly, Fault tolerance by design diversity: Con- cepts and experiments, IEEE Computer, pp. 6780, Aug. 1984. [5] D. Briere and P. Traverse, Airbus A320/A330/A340 electrical flight controls: A family of fault-tolerant systems, in Proc. Int. Symp. Fault- Tolerant Computing, 1993, pp. 616623. [6] J. T.-Y. Chang and E. J. McCluskey, SHOrt Voltage Elevation (SHOVE) test, in Proc. Int. Test Conf., 1996, pp. 4549. [7] J. Christmansson, Z. Kalbarczyk, and J. Torin, Dependable flight con- trol system by data diversity and self-checking components, Micropro- cessor and Microprogramming, vol. 40, no. 2 and 3, pp. 207222, 1994. [8] J. A. Clark and D. K. Pradhan, Fault injection: A method for validating computer system dependability, IEEE Computer, vol. 28, no. 6, pp. 4756, Jun. 1995. [9] M. L. Cortes, Temporary failures in digital circuit: Experimental results and fault modeling, Ph.D. dissertation, Center for Reliable Computing, Stanford Univ., 1987. [10] C. Davis, Common-mode failure in information systems, SciTech J., vol. 6, no. 5, pp. 1315, Jul.Aug. 1996. [11] B. S. Dhillon and C. L. Proctor, Common-mode failure analysis of re- liability networks, in Proc. Ann. Reliability and Maintainability Symp., 1977, pp. 404408. [12] B. S. Dhillon and O. C. Anude, Common-cause failure analysis of a redundant system with repairable units, Int. J. Syst. Sci., vol. 25, no. 3, pp. 527540, Mar. 1994. [13] , Common-cause failure analysis of a k-out-of-n: G system with repairable units, Microelectronics and Reliability, vol. 34, no. 3, pp. 429442, Mar. 1994. [14] , Common-cause failure analysis of a k-out-of-n: G system with nonrepairable units, Int. J. Syst. Sci., vol. 26, no. 10, pp. 20292042, Oct.. [15] Easterling, Probabilistic analysis of common-mode failures, in Proc. ANS Conf. Probabilistic Analysis of Nuclear Safety, 1978. [16] K. N. Fleming, A. Mosleh, and A. P. Kelly, On the analysis of dependent failures in risk assessment and reliability evaluation, Nuclear Safety, vol. 24, pp. 637657, 1983. [17] K. K. Goswami et al., DEPEND: A simulation-based environment for system level dependability analysis, IEEE Trans. Computers, vol. 46, no. 1, pp. 6074, Jan. 1997. [18] R. K. Gulati and C. F. Hawkins, Iddq Testing of VLSI Circuits: Kluwer Academic Publishers, 1993. [19] U. Gunneflo, J. Karlsson, and J. Torin, Evaluation of error detection schemes using fault injection by heavy-ion radiation, in Proc. Int. Symp. Fault-Tolerant Computing, 1989, pp. 340347. [20] S. G. Han and W. H. Yoon, The trinomial failure rate model for treating common-mode failures, Reliability Engineering and SystemSafety, vol. 25, no. 2, pp. 131146, 1989. [21] H. Hao and E. J. McCluskey, Very-low-voltage testing for weak CMOS logic ICs, in Proc. Int. Test Conf., vol. 199, pp. 275284. [22] K. Harada and T. Hidaka, Probability analysis of a 2-out-of-n: F system with common cause failure, Microelectronics and Reliability, vol. 34, no. 2, pp. 289296, Feb. 1994. [23] E. R. Hnatek, Integrated circuit quality and reliability,, 1995. [24] R. K. Iyer and D. Tang, Experimental analysis of computer system dependability, Center for Reliable and High-Performance Computing, Univ. Illinois at Urbana-Champaign, Tech. Rep. CRHC-93-15, 1993. [25] I. M. Jacobs, The common mode failure study discipline, IEEE Trans. Nucl. Sci., vol. 17, no. 1, pp. 594598, Feb. 1970. [26] Z. Kalbarcyzk and J. Christmansson, Technical approaches for reducing the probability of common-cause/common-mode failuresA survey, Lab. Dependable Computing, Department of Computer Engineering, Chalmers Univ. of Technology, Sweden, Tech. Rep. 237, May 1995. [27] J. Karlsson et al., Using heavy-ion radiation to validate fault-handling mechanisms, IEEE Micro., vol. 14, no. 1, pp. 823, Feb. 1994. [28] H. Kim and K. G. Shin, Modeling of externally-induced/common- cause faults in fault-tolerant systems, in Proc. AIAA/IEEE Digital Avionics Systems Conf., 1994, pp. 402407. [29] J. C. Knight and N. G. Leveson, A large scale experiment in N-version programming, in Proc. Int. Symp. Fault-Tolerant Computing, 1985, pp. 135139. [30] J. H. Lala and R. E. Harper, Architectural principles for safety-critical real-time applications, in Proc. IEEE, vol. 82, 1994, pp. 2540. [31] J. Li and E. E. Swartzlander, Concurrent error detection in ALUs by recomputing with rotated operands, in Proc. IEEE Int. Workshop on Defect and Fault Tolerance in VLSI Systems, 1992, pp. 109116. [32] P. Liden et al., On latching probability of particle-induced transients in combinational networks, in Proc. FTCS, 1994, pp. 340349. [33] B. Littlewood, The impact of diversity upon common-mode failures, Reliability Engineering and System Safety, vol. 51, no. 1, pp. 101113, 1996. [34] T. Lovric, Detecting hardware-faults with systematic and design diver- sity: Experimental results, Computer Systems Science and Engineering, vol. 11, no. 2, pp. 8392, 1996. [35] S. Mitra, N. R. Saxena, and E. J. McCluskey, A design diversity metric and reliability analysis for redundant systems, in Proc. Int. Test Conf., 1999, pp. 662671. [36] S. Mitra and E. J. McCluskey, Design of redundant systems protected against common-mode failures, Center for Reliable Computing, Stan- ford Univ., http://crc.stanford.edu, 2000. [37] B. M. E. Moret et al., Boolean difference techniques for time-sequence and common-cause analysis of fault-trees, IEEE Trans. Reliability, vol. R-33, no. 5, pp. 399405, Dec. 1984. [38] A. Mosleh, Common cause failures: An analysis methodology and ex- amples, Reliability Engineering and System Safety, vol. 34, no. 3, pp. 249292, 1991. [39] W. Needham, Designers Guide to Testable ASIC Devices, 1991. [40] G. W. Parry, Common cause failure analysis: A critique and some sug- gestions, Reliability Engineering and System Safety, vol. 34, no. 3, pp. 309326, 1991. [41] J. H. Patel and L. Y. Fung, Concurrent error detection in ALUs by recomputing with shifted operands, IEEE Trans. Computers, vol. C-31, no. 7, pp. 589595, Jul. 1982. [42] D. A. Patterson, P. Chen, G. Gibson, and R. H. Katz, Introduction to redundant array of inexpensive disks, in Proc. COMPCON, 1989, pp. 112117. [43] D. K. Pradhan, Fault-Tolerant Computer System Design: Prentice Hall, 1996. [44] B. Putney, A common-cause evaluation methodology for large fault trees, Trans. American Nuclear Soc., vol. 33, pp. 574575, 1979. [45] Quickturn Design Systems (A Cadence Company), , http://www.quick- turn.com. [46] R. Reed et al., Heavy ion and proton-induced single event multiple upset, IEEE Trans. Nucl. Sci., vol. 44, no. 6, pp. 22242229, Jul. 1997. [47] R. Riter, Modeling and testing a critical fault-tolerant multi-process system, in Proc. Int. Symp. Fault-Tolerant Computing, 1995, pp. 516521. [48] D. P. Siewiorek and R. S. Swarz, Reliable Computer Systems: Design and Evaluation: Digital Press, 1992. [49] Y. Tamir and C. H. Sequin, Reducing common mode failures in du- plicate modules, in Proc. IEEE Int. Conf. Computer Design, 1984, pp. 302307. [50] K. S. Trivedi, Probability and Statistics with Reliability, Queuing, and Computer Science Applications: Prentice Hall, 1982. [51] J. K. Vaurio, The probabilistic modeling of external common cause failure shocks in redundant systems, Reliability Engineering and System Safety, vol. 50, no. 1, pp. 97107, 1995. [52] , An implicit method for incorporating common-cause failures in system analysis, IEEE Trans. Reliability, vol. 47, no. 2, pp. 173180, Jun. 1998. Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on May 04,2010 at 22:34:42 UTC from IEEE Xplore. Restrictions apply. MITRA et al.: CMF IN REDUNDANT VLSI SYSTEMS: A SURVEY 295 [53] I. A. Watson and G. T. Edwards, Common-mode failures in redundancy systems, Nuclear Technology, vol. 46, pp. 183191, Dec. 1979. [54] R. B. Worrell and G. R. Burdick, Qualitative analysis in reliability and safety studies, IEEE Trans. Reliability, vol. R-25, no. 3, pp. 164170, Aug. 1976. [55] E. J. McCluskey, S. Makar, S. Mourad, and K. D. Wagner, Probability models for pseudo-random test sequences, IEEE Trans. Computers, vol. 37, no. 2, pp. 160174, Feb. 1988. [56] Y. Tohma and S. Aoyagi, Failure-tolerant sequential machines with past information, IEEE Trans. Computers, vol. C-20, no. 4, pp. 392396, Apr. 1971. [57] S. Mitra and E. J. McCluskey, Combinational logic synthesis for diver- sity in duplex systems, in Proc. Int. Test Conf., 2000, pp. 179188. Subhasish Mitra is an Assistant Director at the Stanford Center for Reliable Computing (CRC). He received the B.E. (1994) in computer science and engineering from Jadavpur University, Calcutta, India, the M.Tech. (1996) in computer science and engineering from the Indian Institute of Technology, Kharagpur, and the Ph.D. (2000) from Stanford University, California. Prof. E. J. McCluskey was his Ph.D. Thesis Adviser. Dr. Mitra is a Research Associate in the DARPA sponsored ROAR Project at Stanford CRC, and provides part-time consulting in various areas of VLSI design and test. His research interests include digital testing, logic synthesis, and fault-tolerant computing. Dr. Mitra received gold medals for being the top student in the School of Engineering in the undergraduate and M.Tech. levels. Nirmal R. Saxena is an Associate Director at Stanford CRC. His research in- terests include computer architecture, fault-tolerant computing, combinatorial mathematics, probability theory and VLSI design/test. He received the B.E. (1982) in electronics and communication engineering fromOsmania University, India; the M.S. (1984) in electrical engineering fromthe University of Iowa; and the Ph.D. (1991) in electrical engineering from Stanford University. He is a Se- nior Member of the IEEE. Edward J. McCluskey received the A.B. (summa cum laude, 1953) in mathe- matics and physics from Bowdoin College, B.S. (1953), M.S. (1953), and Sc.D. (1956) in electrical engineering from MIT. The degree of Doctor Honoris Causa (1994) was awarded by the Institut National Polytechnique de Grenoble. He worked on electronic switching systems at the Bell Telephone Laborato- ries from 1955 to 1959. In 1959, he moved to Princeton University, where he was Professor of Electrical Engineering and Director of the University Com- puter Center. In 1966, he joined Stanford University, where he is Professor of Electrical Engineering and Computer Science, and Director of the Center for Reliable Computing. He founded the Stanford Digital Systems Laboratory (now the Computer Systems Laboratory) in 1969 and the Stanford Computer Engi- neering Program (now the Computer Science M.S. Degree Program) in 1970. The Stanford Computer Forum (an Industrial Affiliates Program) was started by Dr. McCluskey and two colleagues in 1970 and he was its Director until 1978. McCluskey developed the first algorithm for designing combinational cir- cuitsthe QuineMcCluskey logic minimization procedure as a doctoral stu- dent at MIT. At Bell Labs and Princeton, he developed the modern theory of transients (hazards) in logic networks and formulated the concept of operating modes of sequential circuits. His Stanford research focuses on logic testing, syn- thesis, design for testability, and fault-tolerant computing. Prof. McCluskey and his students at CRC worked out many key ideas for fault-equivalence, proba- bilistic modeling of logic networks, pseudo-exhaustive testing, and watchdog processors. He collaborated with Signetics researchers in developing one of the first practical multivalued logic implementations and then worked out a design technique for such circuitry. Dr. McCluskey was the first President of the IEEE Computer Society. He re- ceived the 1996 IEEE Emanuel R. Piore Award. He is a Fellow of the IEEE, AAAS, and ACM; and a member of the NAE. He has published several books (including two widely used texts) and book chapters as well as hundreds of pa- pers. His most recent book is Logic Design Principles with Emphasis on Testable Semicustom Circuits 1986, by PrenticeHall. His other recent honors include election to the National Academy of Engineering 1998, and IEEE Computer Society Golden Core Member. In 1984 he received the IEEE Centennial Medal and the IEEE Computer Society Technical Achievement Award in Testing. In 1990 he received the EURO ASIC 90 Prize for Fundamental Outstanding Con- tribution to Logic Synthesis. The IEEE Computer Society honored him with the 1991 Taylor L. Booth Education Award. Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on May 04,2010 at 22:34:42 UTC from IEEE Xplore. Restrictions apply.