ABSTRACT Title of Thesis: SYSTEM LEVEL APPROACH FOR LIFE CONSUMPTION MONITORING OF ELECTRONICS Sathyanarayan Ganesan, Master of Science, 2004 Thesis Directed By: Professor Michael Pecht Department of Mechanical Engineering Life consumption monitoring involves monitoring the operating and environmental conditions to predict the remaining life. This thesis presents a life consumption monitoring process that applicable to the system level. Failure Modes, Mechanisms and Effects Analysis (FMMEA) is introduced as a new step in the life consumption monitoring process that systematically identifies potential failure mechanisms and models for all potential failures modes, and prioritizes the failure mechanisms to identify the high priority mechanisms. High priority mechanisms help in determining right suite of product parameters that need to be monitored for determining the damage and life consumed. A case study describing the FMMEA process for a simple electronic circuit board assembly is presented. A new methodology for remaining life prediction has been introduced in the life consumption monitoring process and is validated from a previous life consumption monitoring case study. A discussion on uncertainty and accuracy of prediction is also presented. SYSTEM LEVEL APPROACH FOR LIFE CONSUMPTION MONITORING OF ELECTRONICS By Sathyanarayan Ganesan Thesis submitted to the Faculty of the Graduate School of the University of Maryland, College Park, in partial fulfillment of the requirements for the degree of Master of Science 2004 Advisory Committee: Professor Michael Pecht, Chair Associate Professor Peter Sandborn Associate Professor Bongtae Han ? Copyright by Sathyanarayan Ganesan 2004 ii ACKNOWLEDGEMENTS First of all, I am grateful to Dr. Michael Pecht and Dr. Diganta Das for giving me the opportunity to work in CALCE as a research assistant and undertake this work. They have been an advisor to me in more ways than just academically. Without their guidance this work wouldn?t have been possible. Next, I would like to thank my thesis committee for appreciating and acknowledging my graduate research work. My thanks also extend to Dr. Michael Osterman, Dr. Peter Rodgers, Dr. Valerie Everoy, Dr. Michael Azarian, Dr. Sanka Ganesan and Dan Donahoe for their inputs and suggestions to the thesis. I am greatly thankful to all my colleagues at CALCE (everyone@calce.umd.edu) for their help and support. My special thanks to Dr. Miky Lee, Keith Rogers, Raju Shah, Satchidananda Mishra, Sanjay Tiku, Paul Casey, Subramaniam Rajagopal, Yuki Fukuda, Dr. Ji Wu, Yu-Chul Hwang and Nikhil Vichare, Rajeev Mishra, Anupam Choubey, Sony Mathew, Kaushik Ghosh, Manash Dash, Karambu Nathan, Raj Bahadur, Niranjan Vijayaragavan, Vidyasagar Shetty, Anshul Shrivastava, Sudhir Kumar and Vimal Mayank for their good company. Also last but not the least, my regards and gratitude to my parents for their constant support and motivation. iii TABLE OF CONTENTS ABSTRACT........................................................................................................................ 1 CHAPTER 1: INTRODUCTION .................................................................................. 1 CHAPTER 2: IMPROVED LIFE CONSUMPTION MONITORING APPROACH ... 4 CHAPTER 3: FAILURE MODES, MECHANISMS AND EFFECTS ANALYSIS METHODOLOGY ............................................................................................................. 6 3.1 FAILURE MODES, MECHANISMS AND EFFECTS ANALYSIS METHODOLOGY ................. 11 3.1.1 System definition, elements and functions ............................................................12 3.1.2 Potential failure modes ..........................................................................................13 3.1.3 Potential failure causes ..........................................................................................14 3.1.4 Potential failure mechanisms.................................................................................14 3.1.5 Failure models........................................................................................................15 3.1.6 Life cycle environment and operating conditions..................................................16 3.1.7 Failure mechanism prioritization...........................................................................16 3.1.8 Documentation.......................................................................................................21 3.2 CASE STUDY.............................................................................................................. 22 3.3 BENEFITS .................................................................................................................. 26 CHAPTER 4: PROGNOSTICS AND REMAINING LIFE ESTIMATION............... 31 4.1.1 Leap-Frog Technique.............................................................................................33 4.1.2 Accuracy of Remaining Life Prediction ................................................................37 iv CHAPTER 5: CONCLUSIONS................................................................................... 40 CHAPTER 6: REFERENCES...................................................................................... 42 v LIST OF FIGURES Figure 1: Improved life consumption monitoring methodology......................................... 5 Figure 2: FMEA worksheet [13]......................................................................................... 9 Figure 3: FMMEA methodology...................................................................................... 13 Figure 4: Failure mechanism prioritization....................................................................... 17 Figure 5: Elements in the circuit card system................................................................... 22 Figure 6: Choosing the right data window........................................................................ 33 Figure 7: LEAP ? Frog Algorithm.................................................................................... 34 Figure 8: Comparison of remaining life estimation models ? Gradual ............................ 35 Figure 9: Comparison of remaining life estimation models ? Gradual & Sudden ........... 36 Figure 10: Accuracy of prediction.................................................................................... 37 Figure 11: Error and uncertainty in predictions................................................................ 39 vi LIST OF TABLES Table 1: Occurrence ratings 20 Table 2: Severity ratings 20 Table 3: Risk Matrix 21 Table 4: FMMEA worksheet for the Case Study 29 1 Chapter 1: INTRODUCTION Health monitoring is a method of assessing the degradation of a product health (reliability) in its life cycle environment by continuous or periodic monitoring, and interpretation of, the parameters indicative of its health. Based on the product?s health, determined from the monitored actual life cycle conditions, procedures can be developed to maintain the product [32]. Health monitoring therefore permits new products to be concurrently designed for a life cycle environment known through monitoring. Product health monitoring can be implemented through the use of various techniques to sense and interpret the parameters indicative of: 1. Performance degradation, (e.g. deviation of operating parameters from their expected values); 2. Physical or electrical degradation (e.g. cracks, corrosion, delamination, increase in electrical resistance or threshold voltage); 3. Changes in life cycle environment (e.g. usage duration and frequency, ambient temperature, vibration, shock, humidity, etc.). Based on the product?s health, determined from the monitored actual life cycle conditions, procedures can be developed to maintain the product [28]. Health monitoring therefore permits new products to be concurrently designed for a life cycle environment known through monitoring. Health monitoring systems are typically categorized as diagnostic, prognostic, or life consumption monitoring (LCM) systems. Diagnostic systems monitor the current operating state of health to identify potential causes of failure in order to restore the system. These systems are widely used across 2 different industries for fault identification purposes. An example of a diagnostic system is the use of piezoelectric sensors, which detect and analyze the ultrasonic acoustic signals traveling through machinery to report fault or wearout condition [29]. Prognostic systems monitor the faults or precursors to failure, and predict the time or number of operational cycles to failure induced by a monitored fault. Examples of prognostic systems include Self-Monitoring Analysis and Reporting Technology (SMART) employed in computer hard drives [30]. LCM is a method of monitoring parameters indicative of a system?s life cycle health and converting the measured data into life consumed [32]. The LCM process involves continuous or periodic measuring, sensing, recording, and interpretation of product parameters to quantify the amount of product degradation. LCM systems have been introduced in the automotive industry, for automotive engine oil monitoring [31] the degradation of which depends upon time, temperature, and contamination related to engine usage. Such LCM systems incorporate physics based mechanisms and predictive models which estimate the remaining life of oil based on the monitored engine usage. The model algorithms are programmed into the engine control modules to inform the driver of the oil life status. Failures in electronic products are often attributable to various combinations, intensities, and durations of environmental loads, such as temperature, humidity, vibration, and radiation. For many of the failure mechanisms in electronic products, there are models that relate environmental loads to the time to failure of the product. Thus, by monitoring the environment of the product over its life cycle, it may be possible to 3 determine the amount of damage induced by various loads and predict when the product might fail [32]. Based on the knowledge of the product failure mechanisms, appropriate life consumption monitoring systems and prognostics strategies can be developed. This paper discusses the application of FMMEA to supporting the implementation of such strategies. The FMMEA methodology is presented, having the potential to contribute for effective life consumption monitoring of electronics. Ideally the products should be designed such that the threshold of damage required to cause failure should not occur within the usage life of the product. To achieve that, knowledge of usage environment, the failure modes, failure mechanisms and its impact on the design is necessary. To evaluate the product reliability and to design for reliability all relevant failure mechanisms must be considered. This task of determining the set of relevant failure mechanisms can become a large undertaking for most electronic systems. FMMEA can be used to achieve this task in a systematic manner. 4 Chapter 2: IMPROVED LIFE CONSUMPTION MONITORING APPROACH Ramakrishnan [26] proposed a physics-of-failure based methodology for determining the damage or life consumed in a product. It showed how to make the environmental data compatible with physics-of-failure models to estimate the amount of accumulated damage. Mishra [27] extended the approach to include estimation of the remaining life of a product from the collected environmental parameters. System level life consumption monitoring requires the systematically identifying all the parameters that drive the failure in a system and monitor those select few parameters that drive the failure. FMMEA is introduced as a new step in the LCM process that systematically identifies all failure mechanisms and models for all potential failures modes, and prioritizes the failure mechanisms to identify the high priority mechanisms. Ideally all failure mechanisms and their interactions must be considered for product design and analysis. In the life cycle of a product, several failure mechanisms may be activated by different environmental and operational parameters acting at various stress levels, but only a few operational and environmental parameters and failure mechanisms are in general responsible for the majority of the failures. High priority mechanisms provide effective utilization of resources and are those select failure mechanisms that determine the operational stresses and the environmental and operational parameters that must be accounted for in the design or be controlled. This enables the right suite of product parameters that need to be monitored for determining the damage and life consumed. The monitored parameters are then simplified to reduce memory requirement and also to be compatible with the failure models associated with the critical failure 5 mechanisms. The simplified data is used for performing stress and damage accumulation analysis and accumulated damage is subsequently used for predicting the remaining life. The traditional remaining life estimation algorithm used in earlier approaches in life consumption monitoring process were overtly simplistic and took into account only the previous data point during iteration. A new prognostic algorithm for remaining life prediction has been demonstrated that incorporates all data points and a comparison has been made with other remaining life prediction models using data from a previous case study. The new LCM methodology has five steps to estimate the remaining life of an electronic product as shown in Figure 1 . These steps include FMMEA, data processing and simplification, stress and damage accumulation analysis and remaining life estimation. Monitor product parameters Conduct data simplification and processing Perform stress and damage accumulation analysis Conduct FMMEA Estimate the remaining life of the product Remedial action?Schedule Maintenance orReplace product YesNo Continue monitoring Figure 1: Improved life consumption monitoring methodology 6 Chapter 3: FAILURE MODES, MECHANISMS AND EFFECTS ANALYSIS METHODOLOGY The competitive marketplace and need for reducing life cycle cost for products are making the product developers and manufacturers look for economic ways to improve the product development process. Increased demands on companies for high quality, reliable products and the increasing capabilities and functionality of many products are making it difficult for manufacturers to maintain the quality and reliability. Industry has been interested in a systematic approach that gives a better understanding of the potential failures and how they might affect product performance. Some organizations are either using or requiring the use of Failure Mode and Effects Analysis (FMEA) towards achieving that goal. Failure is the loss of the ability of a product to perform its required function [1]. FMEA is a systematic procedure to evaluate potential failures, identify the effects of failures, and determine actions which could eliminate or reduce the chance of the potential failure occurring [10]. FMEA was developed as a formal methodology in the 1950?s at Grumman Aircraft Corporation, where it was used to analyze the safety of flight control systems for naval aircraft. From the 1970?s through the 1990?s, various military and professional society standards and procedures were written to define the FMEA methodology [7] [8] [13] to meet the needs for various industry sectors. In 1971, the Electronic Industries Association (EIA) G-41 committee on reliability published ?Failure Mode and Effects Analysis?. In 1974, the US Department of Defense published Mil-Std 1629 ?Procedures for Performing a Failure Mode, Effects and Criticality Analysis? which through several revisions became the basic approach for analyzing systems. In 1985, the International Electrotechnical 7 Commission (IEC) introduced IEC 812 ?Analysis Techniques for System Reliability ? Procedure for Failure Modes and Effects Analysis?. In the late 1980?s the automotive industry adopted the FMEA practice. In 1993, the Supplier Quality Requirements Task Force comprised of representatives from Chrysler, Ford and GM, introduced FMEA into the quality manuals through the QS 9000 process. In 1994, Society of Automotive Engineers (SAE) published SAE J-1739 ?Potential Failure Modes and Effects Analysis in Design and Potential Failure Modes and Effects Analysis in Manufacturing and Assembly Processes? reference manual that provided general guidelines in preparing an FMEA. In 1999, Daimler Chrysler, Ford and GM as part of the International Automotive Task Force agreed to recognize the new international standard ?ISO/TS 16949? that included FMEA and would eventually replace QS 9000 in 2006. FMEAs are used across many industries and are often referred to by types such as System FMEA, Design FMEA (DFMEA), Process FMEA (PFMEA), Machinery FMEA (MFMEA), Functional FMEA, Interface FMEA and Detailed FMEA. Although the purpose, terminology and details can vary according to type and industry, the principle objectives of FMEAs are to anticipate the most important problems early in the development process and either prevent the problems or minimize their consequences. FMEA can be applied at any point in the product life cycle from the design to the end-of- life and provide a formal and systematic approach for product and process development. FMEA was initially limited to the analysis of the effects of the failure modes for safety analysis. Failure Modes, Effects and Criticality Analysis (FMECA) was considered an extension of FMEA that included assessing the probability of occurrence and criticality of potential failure modes. Today, the distinctions between the two have 8 become less well defined and the terms FMEA and FMECA are used interchangeably [6] [7]. FMEA is also one of the six sigma tools [9] and is utilized by some six sigma organizations in some form or the other. The FMEA methodology is based on a hierarchical approach to determine how possible failure modes affect the system [7]. The basic procedure is to: 1. Identify elements or functions in the system 2. Identify all element or function failure modes 3. Determine the effect(s) of each failure mode and its severity 4. Determine the cause(s) of each failure mode and its probability of occurrence 5. Identify the current controls in place to prevent or detect the potential failure modes 6. Assess risk, prioritize failures and assign corrective actions to eliminate or mitigate the risk 7. Document the process FMEA involves inputs from a cross-functional team having the ability to analyze the whole product life cycle [15]. To achieve the greatest value, FMEA should be conducted before a failure mode has been unknowingly built into the product when the design changes are easier and less expensive [4]. A typical design FMEA worksheet is shown in Figure 2. For risk assessment, an FMEA uses occurrence and detection probabilities in conjunction with severity criteria to develop a risk priority number (RPN). RPN is the product of severity, occurrence and detection. After the RPNs are evaluated, they are prioritized and corrective actions are taken to mitigate the risk. Once the corrective actions are implemented, the severity, occurrence and detection values are reassessed, 9 and a new RPN is calculated. This process continues until the risk level is acceptable. Thus, FMEA is reviewed and updated periodically. System Potential FMEA Number Subsystem Failure Mode and Effects Analysis Prepared By Component (Design FMEA) FMEA Date Design Lead Key Date Revision Date Core Team Page of Action Results Item / Function Potential Failure Mode(s) Potential Effect(s) of Failure Sev Potential Cause(s) of Failure Prob Current Design Controls Det RP N Recommended Action(s) Responsibility & Target Completion Date Actions Taken Ne w Se v Ne w Oc c Ne w De t Ne w RP N Figure 2: FMEA worksheet [13] Neither FMEA nor FMECA identify the failure mechanisms and models in analysis and reporting process. Failure mechanisms are the processes by which specific combination of physical, electrical, chemical and mechanical stresses induce failure [1]. In order to understand and prevent failures, they must be identified with respect to the predominant stresses (mechanical, thermal, electrical, chemical, radiation) which cause them. The knowledge about the cause and consequences of these mechanisms help in several design and development steps. These include virtual qualification, accelerated testing, root cause analysis and life consumption monitoring which are essential for developing reliable products in an economical manner. Besides these benefits, understanding the failure mechanisms also helps to identify the acceptable level of ?defects? and variability in manufacturing and material parameters and to specify appropriate ratings for the products. Because of its lack of utilization of failure mechanism information, FMEA cannot provide meaningful input to procedures such as virtual qualification, root cause analysis and accelerated test programs. In FMEA, all failure modes are considered individually 10 and the combined effect of the failure modes is not taken into consideration. Also FMEA is based on precipitation and detection of failure and it is not designed to be applied in cases that involve a continuous monitoring of performance degradation over time such as life consumption monitoring and prognostics. Use of environmental and operating conditions is not made at a quantitative level in FMEA. At best they are used to eliminate certain failure modes from consideration. For failure prioritization in FMEA, a qualitative scale is transformed into a quantitative scale for evaluating RPN. Potential failure modes having higher RPNs are assumed to represent a higher risk than those having lower numbers. In the transformation of qualitative to quantitative scale all three indices, severity, occurrence and detection have the same metric and are equally important [9]. Thus, small changes in one of the factors from which the RPN is computed can have different effects on the RPN. Hence, some implementations of FMEA provide a false sense of granularity between the different failure modes when none exists. For example, if detection and occurrence both have a rating of 10, a 1 point difference in the severity ranking results in a 100 point difference in the RPN; at the other extreme of detection and occurrence have a rating equal to 1, the same 1 point difference only gives a 1 point difference in the RPN [11]. There is a need to prioritize failures using the predominant stresses from the environmental and operating conditions from which they arise and evaluate them quantitatively to make the prioritization process more scientific. Use of failure mechanisms and models is a step in that direction. The task of determining the failure mechanisms can become a large undertaking for most products and systems. FMMEA can be used to achieve this task in a systematic manner. 11 3.1 Failure modes, mechanisms and effects analysis methodology FMMEA is a systematic approach to identify failure mechanisms and models for all potential failures modes, and prioritize them to identify high priority failure mechanisms. High priority failure mechanisms determine the operational stresses and the environmental and operational parameters that need to be accounted for in the design or be controlled. FMMEA is based on understanding the relationships between product requirements and the physical characteristics of the product (and their variation in the production process), the interactions of product materials with loads (stresses at application conditions) and their influence on product failure susceptibility with respect to the use conditions. This involves finding the failure mechanisms and the reliability models to quantitatively evaluate failure susceptibility. The FMMEA process merges the systematic nature of the FMEA template with the ?design for reliability? philosophy and knowledge. In addition to the information gathered and used for FMEA, FMMEA uses application conditions and the duration of the intended application with knowledge of active stresses and potential failure mechanisms. The potential failure mechanisms are considered individually, and their assessment using appropriate models enables design and qualification the product for the intended application. The steps in conducting a FMMEA are illustrated in Figure 3. The individual steps are described in greater detail in the following sections. 12 3.1.1 System definition, elements and functions FMMEA process begins by defining the system to be analyzed. A system is a composite of subsystems or levels that are integrated to achieve a specific objective. The system is divided into various sub-systems or levels and it continues to the lowest possible level, which is a ?component? or ?element?. Based on convenience or needs of the team conducting the analysis, the system breakdown can be either by function (i.e., according to what the system elements ?do?), or by location (i.e., according to where the system elements ?are?), or both (i.e., functional within the location based, or vice versa). For example in an automobile system, a functional breakdown would involve cooling system, braking system, and propulsion system. A location breakdown would involve engine compartment, passenger compartment and dashboard or control panel. In a printed circuit board system, a location breakdown would include the package, plated though hole (PTH), metallization, and the board itself. For each component or element all the associated functions are listed. For example the primary function of a solder joint is to interconnect two materials. Hence, the failure of a solder joint will relate to its inability to perform as a physical and electrical interconnect. 13 Identify life cycle environmental and operating conditions Identify potential failure modes Identify potential failure mechanisms Identify failure models Define system and identify elements and its functions to be analyzed Identify potential failure causes Prioritize failure mechanisms Document the process Figure 3: FMMEA methodology 3.1.2 Potential failure modes A failure mode is defined as the way in which a component, subsystem, or system could fail to meet or deliver the intended function [10]. For example, in a solder joint the potential failure modes are open or intermittent change in resistance, that can hamper its functioning as an interconnect. A potential failure mode may be the cause of a potential failure mode in a higher level subsystem, or system, or be the effect of one in a lower level component. For all the elements that have been identified, all possible failure modes for each given element are listed. In cases where information on possible failure modes that may occur is not available, potential failure modes may be identified using numerical stress analysis, accelerated tests to failure (e.g., HALT), past experience and engineering judgment [12]. 14 3.1.3 Potential failure causes A failure cause is defined as the circumstances during design, manufacture, or use that lead to a failure mode [12]. For each failure mode, all possible ways a failure can result are listed. Failure causes are identified by finding the basic reason that may lead to a failure during design, manufacturing, storage, transportation or use condition. Knowledge of potential failure causes can help identify the failure mechanisms driving the failure modes for a given element. For example, in an automotive underhood environment the solder joint failure modes open and intermittent change in resistance can potentially be caused due to temperature cycling, random vibration and shock impact. 3.1.4 Potential failure mechanisms Failure mechanisms are the processes by which specific combination of physical, electrical, chemical and mechanical stresses induce failure [1]. Failure mechanisms are determined based on combination of potential failure mode and cause of failure [5] and selection of appropriate available mechanisms corresponding to the failure mode and cause. Studies on electronic material failure mechanisms, and the application of physics based damage models to the design of reliable electronic products comprising all relevant wearout and overstress failures in electronics are available in literature [2] [3]. Failure mechanisms thus identified are categorized as either overstress or wearout mechanisms. Overstress failures involve a failure that arises as a result of a single load (stress) condition. Wearout failure on the other hand involves a failure that arises as a result of cumulative load (stress) conditions [12]. For example, in the case of solder joint, the potential failure mechanisms driving the opens and shorts caused by temperature, vibration and shock impact are fatigue and overstress shock. Further analysis of the failure mechanisms depend on the type of mechanism. 15 3.1.5 Failure models Failure models use stress and damage analysis to evaluate susceptibility of failure. Failure susceptibility is evaluated by assessing the time-to-failure or likelihood of a failure for a given geometry, material construction, environmental and operational condition. For example, in case of solder joint fatigue, Dasgupta [22] and Coffin-Manson [20] failure models are used for stress and damage analysis for temperature cycling. Failure models of overstress mechanisms use stress analysis to estimate the likelihood of a failure based on a single exposure to a defined stress condition. The simplest formulation for an overstress model is the comparison of an induced stress versus the strength of the material that must sustain that stress. Wearout mechanisms are analyzed using both stress and damage analysis to calculate the time required to induce failure based on a defined stress condition. In the case of wearout failures, damage is accumulated over a period until the item is no longer able to withstand the applied load. Therefore, an appropriate method for combining multiple conditions must be determined for assessing the time to failure. Sometimes, the damage due to the individual loading conditions may be analyzed separately, and the failure assessment results may be combined in a cumulative manner [13]. Failure models may be limited by the availability and accuracy of models for quantifying the time to failure of the system. It may also be limited by the ability to combine the results of multiple failure models for a single failure site and the ability to combine results of the same model for multiple stress conditions [12]. If no failure models are available, the appropriate parameter(s) to monitor can be selected based on an empirical model developed from prior field failure data or models derived from accelerated testing. 16 3.1.6 Life cycle environment and operating conditions Life cycle loads include environmental conditions such as temperature, humidity, pressure, vibration or shock, chemical environments, radiation, contaminants, and loads due to operating conditions, such as current, voltage, and power [1]. The life cycle environment of a product consists of assembly, storage, handling, and usage conditions of the product, including the severity and duration of these conditions. Information on life cycle conditions, can be used for eliminating failure modes that may not occur under the given application conditions. In the absence of field data, information on the product usage conditions can be obtained from environmental handbooks or data monitored in similar environments. Ideally, such data should be obtained and processed during actual application. Recorded data from the life cycle stages for the same or similar products can serve as input towards the FMMEA process. Some organizations collect, record, and publish data in the form of handbooks that provide guidelines for designers and engineers developing products for market sectors of their interest. Such handbooks can provide first approximations for environmental conditions that a product is expected to undergo during operation. These handbooks typically provide an aggregate value of environmental variables and do not cover all the life cycle conditions. For example, for automotive application, life cycle environment and operating condition can be obtained from SAE handbook [23], but the application conditions even in the SAE handbook are limited. 3.1.7 Failure mechanism prioritization Ideally all failure mechanisms and their interactions must be considered for product design and analysis. In the life cycle of a product, several failure mechanisms may be activated by different environmental and operational parameters acting at various stress 17 levels, but only a few operational and environmental parameters and failure mechanisms are in general responsible for the majority of the failures. High priority mechanisms are those select failure mechanisms that determine the operational stresses and the environmental and operational parameters that must be accounted for in the design or be controlled. High priority failure mechanisms are identified through prioritization of all the potential failure mechanisms. The methodology for failure mechanism prioritization is shown in Figure 4. First level prioritization Second level prioritization Evaluate severity Evaluate failure susceptibility Evaluate occurrence Potential failure mechanism list High risk Medium risk Low risk Figure 4: Failure mechanism prioritization Environmental and operating conditions are used for first level prioritization of all potential failure mechanisms. If the stress levels generated by certain operational and environmental conditions are non-existent or negligible, the failure mechanisms that are exclusively dependent on those environmental and operating conditions are assigned a ?low? risk level and are eliminated from further consideration. 18 For all the failure mechanisms remaining after the first level prioritization, the susceptibility to failure by those mechanisms is evaluated using the previously identified failure models when such models are available. For the overstress mechanisms, failure susceptibility is evaluated by conducting a stress analysis to determine if failure is precipitated under the given environmental and operating conditions. For the wearout mechanisms, failure susceptibility is evaluated by determining the time-to-failure under the given environmental and operating conditions. To determine the combined effect of all wearout failures, the overall time-to-failure is also evaluated with all wearout mechanisms acting simultaneously. In cases where no failure models are available, the evaluation is based on past experience, manufacturer data, or handbooks. After evaluation of failure susceptibility, occurrence ratings under environmental and operating conditions applicable to the system are assigned to the failure mechanisms. For the overstress failure mechanisms that precipitate failure, highest occurrence rating ?frequent? is assigned. In case no overstress failures are precipitated, the lowest occurrence rating ?extremely unlikely? is assigned. For the wearout failure mechanisms, the ratings are assigned based on benchmarking the individual time-to-failure for a given wearout mechanism, with overall time-to-failure, expected product life, past experience and engineering judgment. The occurrence ratings shown in Table 1 are defined below. A ?frequent? occurrence rating involves failure mechanisms with very low time-to- failure (TTF) and overstress failures that are almost inevitable in the use condition. A ?reasonably probable? rating involves cases that involve failure mechanisms with low TTF. An ?occasional? involves failures with moderate TTF. A ?remote? rating involves failure mechanisms that have a high TTF. An extremely unlikely rating is assigned to 19 failures with very high TTF or overstress failure mechanisms that do not produce any failure. To provide a qualitative measure of the failure effect, each failure mechanism is assigned a severity rating. The failure effect is assessed first at the level being analyzed, then the next higher level, the subsystem level, and so on to the system level. Safety issues and impact of a failure mechanism on the end system are used as the primary criterion for assigning the severity ratings. In the severity rating, possible worst case consequence is assumed for the failure mechanism being analyzed. Past experience and engineering judgment may also be used in assigning severity ratings. The severity ratings shown in Table 2 are defined below. A ?very high or catastrophic? severity rating involves failure mode that may involve loss of life or complete failure of the system. A ?high? severity rating may involve a failure mode that might cause a severe injury or a loss of function of the system. A ?moderate or significant? involves failure modes which may cause minor injury or gradual degradation in performance over time through loss of availability. A ?low or minor? rating may involve a failure mode that may not cause any injury or result in the system operating at reduced performance. A ?very low or none? rating does not cause any injury and has no impact on the system or at the best may be a minor nuisance. Second level prioritization involves prioritizing the failure mechanisms into three risk levels using the risk matrix shown in Table 3. In principle, all failure mechanisms with a ?high risk? level are high priority mechanisms that need to be accounted for and controlled. Mechanisms having lower risk levels can also be classified as high priority or 20 further prioritization within a given risk level may be done depending on product type, use condition, needs and objectives of organization. Table 1: Occurrence ratings Rating Criteria Frequent Overstress failure or very low TTF Reasonably Probable Low TTF Occasional Moderate TTF Remote High TTF Extremely Unlikely No overstress failure or very high TTF Table 2: Severity ratings Rating Criteria Very high or catastrophic System failure or safety-related catastrophic failures High Loss of function Moderate or significant Gradual performance degradation Low or minor System operable at reduced performance Very low or none Minor nuisance 21 Table 3: Risk Matrix OCCURRENCE Frequent Reasonably Probable Occasional Remote Extremely Unlikely Very high or catastrophic High High High Moderate Moderate High High High Moderate Moderate Low Moderate or significant High Moderate Moderate Low Low Low or minor High Moderate Low Low Low S EV ER IT Y Very low or none Moderate Moderate Low Low Low 3.1.8 Documentation The documentation of the FMMEA process facilitates data organization, distribution, and analysis. For products already developed and manufactured, root-cause analysis is conducted for identified high priority mechanisms and corrective actions taken to mitigate the risk. Once the corrective actions are implemented, the failure prioritization may be conducted again to reassess the risks. This process continues until the risk level is acceptable. The history and lessons learned contained within the documentation provide a framework for future product introductions. 22 3.2 Case study A printed circuit board (PCB) assembly used in automotive application was used to demonstrate the FMMEA process. The system was an FR-4 PCB with copper metallizations, plated through-hole (PTH) and eight surface mount inductors soldered to the pads using 63Sn-37Pb solder. The ends of inductors were connected to the PTH through the PCB metallization. The PTHs were solder filled and an event detector circuit was connected in series with all the inductors through the PTHs. The PCB assembly was mounted at all four corners in the engine compartment of a 1997 Toyota 4Runner. Mountings were not considered as failure locations. System failure was defined as one that would result in breakdown, or no current passage in the event detector circuit. The system was broken down by location into six different elements: surface mount inductor, pad, PTH, PCB, metallization and solder interconnect as shown in Figure 5. PCB Inductor MetallizationPTH Interconnect Pad Mounting Figure 5: Elements in the circuit card system For all the elements listed, the corresponding functions and the potential failure modes were identified. The function of all the elements was to maintain electrical 23 continuity. For the PCB, besides electrical continuity, an additional function included providing mechanical support to the system. Table 4 shows the physical location of all possible failure modes for the listed elements. For example, for the solder joint the potential failure modes are open and intermittent change in resistance. For sake of simplicity and demonstration purposes, it was assumed that the test set up, the board and its components were defect free. Also stresses induced on the board and its components from manufacturing, storing, handling and transportation were assumed to be negligible. Potential failure causes were identified for the failure modes and the listing is shown in Table 4. For example, for the solder joint the potential failure causes for open and intermittent change in resistance are temperature cycling, random vibration or sudden shock impact caused by vehicle collision. Based on the potential failure causes that were assigned for the failure modes, the corresponding potential failure mechanisms were identified. Table 4 lists the failure mechanisms for the failure causes that were identified. For example, for the open and intermittent change in resistance in solder joint, the mechanisms driving the failure were solder joint fatigue and shock. For each of the failure mechanisms listed, the appropriate failure model was identified from literature. Information about product dimensions and geometry were obtained from design specification, board layout drawing and component manufacturer data sheets. Table 4 shows all the failure models for the failure mechanisms that were listed. For example, in case of solder joint fatigue, Coffin-Manson [20] failure model was used for stress and damage analysis for temperature cycling. 24 The assembly was powered by a three volt battery source independent of the automobile electrical system. There were no high current, voltage, magnetic or radiation sources in the area. For the temperature, vibration and humidity conditions prevalent in the automotive underhood environment, data was obtained from Society of Automotive Engineers (SAE) environmental handbook [23] as no manufacturer field data were available for the automotive underhood environment for the Washington, DC area. The maximum temperature in the automotive underhood environment was 121?C [23].The car was assumed to operate on average three hours per day in two equal trips in the Washington, DC area. Random vibration effects were assumed and maximum shock level was assumed to be 45G for 3ms. The maximum relative humidity in the underhood environment was 98% at 38oC [23]. The average daily maximum and minimum temperature in the Washington, DC area [24] for the period the study was conducted were 127 ? C and 16 ? C respectively. After all potential failure modes, causes, mechanisms and models were identified for each element, the first level prioritization was made. The first level prioritization was made based on the life cycle environmental and operating conditions. In automotive underhood environment for the given test set up, failures driven by electrical overstress (EOS), electrostatic discharge (ESD) were ruled out because of the absence of active devices, low voltage source of the batteries and relatively large thickness of PCB. Electromagnetic interference (EMI) was also not expected, because besides the low voltage and current from the batteries used to power the test setup, there was no high current, voltage, or magnetic sources in the test area. Hence EOS, ESD and EMI were assigned ?low? risk level. 25 The time to failure for the wearout failure mechanisms was calculated using calcePWA1. Occurrence ratings were assigned based on benchmarking the time-to-failure for a given wearout mechanism with the overall time-to-failure with all wearout mechanisms acting together. For the inductors there was no failure model available, and the occurrence rating was assigned based on failure rate data of inductors obtained from Telcordia handbook [25]. Since no model was available for wearout associated with the pads, it was arbitrarily assigned a ?remote? occurrence rating. An assessment of shock as overstress mechanism, with a shock level of 45G for 3ms using calcePWA produced no failure for interconnects and the board, hence it was assigned an ?extremely unlikely? occurrence rating. Since no overstress shock failure was expected on the board and the interconnects, it was assumed there would also be no failure on the pads. Hence overstress shock failure on pads (for which no model was available) was also assigned an ?extremely unlikely? rating. Glass transition temperature for the board was 150?C. Since the maximum temperature in the underhood environment was only 121?C [23], no glass transition was expected to occur and it was assigned an ?extremely unlikely? rating. A short or open PTH would not have had any impact on the functioning of circuit, as it was used only as terminations for the inductors. Hence, it was assigned a ?very low? severity rating. For all other elements, any given failure mode of the element would have led to the disruption in the functioning of circuit. Hence, all other elements were assigned a ?very high? severity rating. 1 A physics-of-failure based virtual reliability assessment tool developed by CALCE Electronic Products and Systems Center 26 Second level prioritization and risk assessment for the failure mechanisms is shown in Table 4. The entire process was documented in a single worksheet as shown in Table 4. Out of all the failure mechanisms that were analyzed, fatigue due to thermal cycling and vibration at the solder joint interconnect were the only failure mechanisms that had a high risk. Being a high risk failure mechanism they were identified as high priority. An FMEA on the assembly would have identified all the elements, their functions, potential failure modes and failure causes as in FMMEA. FMEA would then have identified the effect of failure. For example, in the case of a solder joint interconnect, the failure effect of the open joint would have involved no current passage in the test set up. Next the FMEA would have identified the severity, occurrence and detection probabilities associated with each failure mode. For example, in case of a solder joint open failure mode, based on past experience and use of engineering judgment each of the metrics, severity, occurrence and detection would have received a rating on a scale of ten. The product of severity, occurrence and detection would then have been used to calculate RPN. The RPNs for other failure modes would have been calculated in a similar manner and then all the failure modes would have been prioritized based on the RPN values. This is unlike FMMEA which used failure mechanisms and models and used combined effect of all failure mechanism to quantitatively evaluate the occurrence and then in conjunction with severity assigned a risk level to each failure mechanisms for prioritization. 3.3 Benefits FMMEA allows the design team to take into account the available scientific knowledge of failure mechanisms and merge them with the systematic features of the FMEA template with the intent of ?design for reliability? philosophy and knowledge. The 27 part of the FMEA that is incorporated in the FMMEA aids in being systematic in the identification process so that all the elements are considered and nothing gets overlooked. The idea of prioritization embedded in the FMEA process is also utilized in FMMEA to identify the mechanisms that are likely to cause failures during the product life cycle. FMMEA differs from FMEA in a few respects. In FMEA, potential failure modes are examined individually and the combined effects of coexisting failures causes are not considered. FMMEA on the other hand considers the impact of failure mechanisms acting simultaneously. FMEA involves precipitation and detection of failure for updating and calculating the RPN, and cannot be applied in cases that involve a continuous monitoring of performance degradation over time. FMMEA on the contrary does not require the failure to be precipitated and detected, and the uncertainties associated with the detection estimation are not present. Use of environmental and operating conditions is not made at a quantitative level in FMEA. At best they are used to eliminate certain failure modes. FMMEA prioritizes the failure mechanisms using the information on stress levels of environmental and operating conditions to identify high priority mechanisms that must be accounted for in the design or be controlled. This prioritization in FMMEA overcomes the shortcomings of RPN prioritization used in FMEA, which provide a false sense of granularity. Thus the use of FMMEA provides additional quantitative information regarding product reliability and opportunities for improvement than FMEA, as it take into account specific failure mechanisms and the stress levels of environmental and operating conditions into the analysis process. There are several benefits to organizations that use FMMEA. It provides specific information on stress conditions so that that the acceptance and qualification tests yield 28 useable result. Use of the failure models at the development stage of a product also allows for appropriate ?what-if? analysis on proposed technology upgrades. FMMEA can also be used to aid several design and development steps considered to be the best practices, which can only be performed or enhanced by the utilization of the knowledge of failure mechanisms and models. These include virtual qualification, accelerated testing, root cause analysis, life consumption monitoring and prognostics. All the technological and economic benefits provided by these practices are realized better through the adoption of FMMEA. FMMEA enhances the value of FMEA, by identifying and evaluating the relevant failure mechanisms and models, using stress levels of environmental and operating conditions and provides a high return on investment by providing knowledge about the possible failures and their causes in a qualitifiable manner. While FMEA and FMECA are often implemented as a standard requirement or contractual obligation, FMMEA makes the process useful by incorporating the scientific knowledge regarding the failure mechanisms and models. 29 Table 4: FMMEA worksheet for the Case Study Element Potential failure mode Potential failure cause Potential failure mechanism Mechanism type Failure model Failure susceptibility Occurrence Severity Risk PTH Electrical open in PTH Temperature cycling Fatigue Wearout CALCE PTH barrel thermal fatigue [16] > 10 years Remote Very low Low High temperature Electromigration Wearout Black [18] > 10 years Remote Very high Moderate High relative humidity Wearout Metallization Electrical short/ open, change in resistance in the metallization traces Ionic contamination Corrosion Wearout Howard [19] > 10 years Remote Very high Moderate Component (Inductors) Short / open between windings and the core High temperature Wearout of winding insulation Wearout No Model Remote* Very high Moderate Temperature cycling Wearout Coffin-Manson [20] 170 days Frequent Very high High Random vibration Fatigue Wearout Steinberg [21] 43 days Frequent Very high High Interconnect Open/Intermittent change in electrical resistance Sudden impact Shock Overstress Steinberg [21] No failure Extremely unlikely Very high Moderate 30 Element Potential failure mode Potential failure cause Potential failure mechanism Mechanism type Failure model Failure susceptibility Occurrence Severity Risk Electrical short between PTHs High relative humidity CFF Wearout Rudra and Pecht [17] 4.6 years Occasional Very low Low Random vibration Fatigue Wearout Basquin [21] > 10 years Remote Very high Moderate Crack / Fracture Sudden impact Shock Overstress Steinberg [21] No failure Extremely unlikely Very high Moderate Loss of polymer strength High temperature Glass transition Overstress No model No failure Extremely unlikely Very high Moderate Open Discharge of high voltage through dielectric material EOS/ESD Overstress No model Eliminated in first level prioritization Low PCB Excessive noise Proximity to high current or magnetic source EMI Overstress No model Eliminated in first level prioritization Low Temperature cycling / Random vibration Fatigue Wearout Remote Very high Moderate Pad Lift / Crack Sudden impact Shock Overstress No Model Extremely unlikely Very high Moderate * Based on failure rate data of inductors in Telcordia [25] 31 Chapter 4: PROGNOSTICS AND REMAINING LIFE ESTIMATION Remaining life estimation provides estimation of the remaining life based on accumulated damage of the product. Remaining life estimation involves choosing the right amount of historical data for prediction to maximize the system?s adaptability to rapid changes in degradation while maintaining an acceptable amount of predictive variability and uncertainty. Based on the remaining life information, the user can decide whether to keep the product in operation with continuous monitoring or to schedule a maintenance or replacement action. The first step in finding the right historical data is to use regression analysis with largest time window for data acquisition which will give the best estimate of the regression fit [33]. This prediction is tested to see if it is reasonably compatible with the most recent data points. If so, then the regression is used. If not, then the size of the regression window is reduced and the analysis is repeated. This recursive regression continues until it yields a small enough window that is compatible with the most recent data points. As a result, this method can detect if the most recent data points indicate a change from the long-term regression. Remaining life estimations are difficult to formulate, as their accuracy is subject to stochastic processes. As a result of uncertainty, prognostics methods must consider the interrelationships between accuracy, precision and confidence [32]. We have the paradox, the more precise the remaining life estimate, the less probable that this estimate will be correct. Finding where an extrapolated trend meets a condemnation threshold may provide an expectation of remaining life, but it does not provide sufficient information to make a decision. 32 Total useful life of a product at any point on the remaining life vs. time plot can be estimated by adding the time in use (or x-coordinate) and the remaining life (or y- coordinate) at that point. However, this analysis is a one-point estimation and does not take into account the product usage trend. The product usage trend can be taken in account by extrapolating a trend line using the available remaining life data points. The intersection of the trend line with the time axis gives the total useful life of the product as shown in Figure 6. Statistical methods for remaining life prediction include multivariate regression, Bayesian regression methods, time-series analysis and discrimination or clustering analysis. Analysis may focus on single or multiple parameters. For single parameter remaining life prediction the regression model is applied as the data is collected to determine the trends. This is compared in real time to a metric failure limit that is established offline. The point of predicted failure is calculated as the intersection of these two lines. If an unexpected event occurs that dramatically increases degradation it is identified and addressed. The remaining life estimation is updated at the end of a pre-selected time period to take into account any sudden change in the life cycle environment or usage of the product. The amount of data included in the analysis affects the prediction. Use of large amounts of data spanning a long window of data acquisition tends to yield more stable, less variable predictions. However, it also may yield a prediction that is less sensitive to recent changes as shown in Figure 6. Use of a smaller data set spanning the most recent operating history tends to produce predictions with more variability but also more sensitive to current operating conditions. The goal while carrying out remaining life 33 prediction is to choose among varying sizes of data window to maximize the system?s adaptability to change while maintaining an acceptable amount of predictive uncertainty. Trend line based on four points (useful life: 29 days) 0 10 20 30 40 50 0 5 10 15 20 25 30 35 40 45 Time in Use (days) Re ma ini ng L ife (d ay s) Trend line based on all points (useful life: 40 days) Trend line based on three points (useful life: 26 days) Re ma ini ng L ife (d ay s) Re ma ini ng L ife (d ay s) Figure 6: Choosing the right data window 4.1.1 Leap-Frog Technique One of the methods for remaining life estimation is the LEAP Frog technique [33]. In this method the system is assumed to be in a steady state of health. The goal is to predict the future health without being sensitive to data that is correlated and has noise. Also the technique is responsive to changes in system performance, hence changes in the system health are detected and the predictions adjusted accordingly. 34 Select the full data setl t t f ll t s t Perform regression analysisrf r r r si l is Find the expected prediction for the next period i t t r i ti f r t t ri Is the expected prediction compatible with most recent data points ? Is t t r i ti ti l it st r t t i ts Shorten the data setrt t t s t Final predictioni l r i ti Yes No Figure 7: LEAP ? Frog Algorithm The prediction goal is to make a prediction at the current time for the value (and uncertainty intervals) of the system at a future time given all past data, and a relatively small set of models/time windows. The method begins with a regression analysis using large time window for data acquisition, which will likely give the best estimate of the regression fit, if the system is at a constant rate of change of health (maybe steady with a slow rate of degradation). This prediction and an uncertainty distribution about the estimates are tested to see if it the prediction is reasonably compatible with the most recent data points. If so, then the regression is used. If not, then the size of the regression window is reduced and the analysis is repeated. This method continues until it yields a small enough window that is compatible with the most recent data points. A flowchart of the LEAP ? Frog algorithm is shown in Figure 7. As a result, this method can detect if the 35 most recent data points indicate a change from the long-term regression (as would be the case if the system had a change in rate of degradation). In the end, the method uses the longest regression window that does not result in evidence (based on the most recent records) that refutes the assumption of good linear fit; and then uses this window to predict the remaining life. Figure 8 and Figure 9 show three prognostic methods compared with the LEAP-Frog method. 0 10 20 30 40 50 60 70 0 5 10 15 20 25 30 35 40 45 50 55 60 Re ma ini ng L ife (D ay s) Days in Use 0 2 4 6 8 10 12 ALL (Reg) Last 5 (Reg) Last 3 (reg) Leap Frog Av era ge Pr ed ict ion E rr or (D ay s) Re ma ini ng L ife (D ay s) Re ma ini ng L ife (D ay s) Av era ge Pr ed ict ion E rr or (D ay s) Av era ge Pr ed ict ion E rr or (D ay s) Figure 8: Comparison of remaining life estimation models ? Gradual 36 0 5 10 15 20 25 30 35 40 45 50 0 4 8 12 16 20 24 28 32 36 Days in Use Re ma ini ng L ife 0 1 2 3 4 5 6 7 8 9 10 ALL (Reg) Last 5 (Reg) Last 3 (reg) Leap Frog Av era ge Pr ed ict ion E rr or (D ay s) Re ma ini ng L ife Re ma ini ng L ife Av era ge Pr ed ict ion E rr or (D ay s) Av era ge Pr ed ict ion E rr or (D ay s) Figure 9: Comparison of remaining life estimation models ? Gradual & Sudden The four methods used to predict future remaining life are: 1. the regression of the damage on all the data since the start of data collection 2. the regression of the damage on the last 3 data points 3. the regression of the damage on the last 5 data points 4. Leap-Frog The histograms show the average prediction errors for regression methods under the two degradation patterns. The degradation patterns were obtained from the case studies conducted in previous life consumption monitoring studies [26] [27]. For the both the condition shown in Figure 8 (steady degradation) and Figure 9 (slow steady degradation with a sudden change to a fast degradation in between), the LEAP-Frog regression 37 method has the lowest average prediction error. This illustrates the ability of the LEAP- Frog method to adapt to a rapidly changing situation. 4.1.2 Accuracy of Remaining Life Prediction The rate at which updation of the estimation process is carried out affects the accuracy of prediction. The algorithm or model used for estimation and how it trends the data into future estimation of remaining life affects the accuracy of prediction. Accuracy depends on whether future planned operational usages are factored into the estimates. As more data is collected, the level of knowledge about the characteristics of the approaching failure can be refined. As the time of actual failure approaches, the remaining life prediction is likely to become more and more accurate. A B Time Actual time to fail Predicted time of failure Fr eq ue nc y Prognostic failurePrognostic inefficiency Fr eq ue nc y Figure 10: Accuracy of prediction Predictions in the area (A) will always precede failure leading to prognostic inefficiency with equipment removed with remaining useful life and will have a financial cost associated with it. On the other hand predictions in the area (B) will always follow failure 38 and lead to prognostic failure and have inadequate operation performance associated with it. Although some measure of remaining life estimation is possible using the historical data the best estimates will be achieved when the future planned operational usage can be factored in to these estimates. While the current remaining life estimation algorithms use fairly simple metrics and features to measure and characterize the changes in the sensor data, an alternative solution is to use neural nets coupled with appropriate feature extractors. Prediction can be for a short time horizon or an estimate in time until the part needs to be replaced or a failure will occur. Spreading of the error bars associated with the remaining life prediction give an indication whether the remaining life estimation models are good for a short or longer time horizons. If the error bars spread rapidly then the predictions are reliable for only the shorter time horizon. If they are narrow and follow the true trajectory accurately then the information from the predictions is useful for the longer time horizons. Approaches to remaining life prediction can be broadly be classified into three categories. The first are the physical models that have been developed and validated with large data sets. The second are the systems that use rule of thumb and the third are the statistical models that learn from the historical data. While the physical and the rule of thumb based models have the capability for anticipating fault events that are yet to occur, the learning systems based on statistical models are only as good as the data for which they have been trained. But the learning systems have the capability to process a wide variety of data types and have the edge over 39 the other methods as they exploit the nuances in the data and this is particularly true for new sources of data for which expert analysis, physical models and rules have not yet been developed. In practice failure indications become more pronounced and easier to interpret as remaining life decreases. In general the true remaining life probability density functions (PDF) should become narrower (less uncertain) and more stable as the damage condition progresses towards failure [34] as shown in Figure 11. Tim e a nd add itio nal Da ta I ncr eas e Curr ent T ime Time Actual Failure Expected Failure Prediction Error Uncertainty Prediction error = Actual ? Expected failure Uncertainty = Spread of failure distribution Time Figure 11: Error and uncertainty in predictions Allowing higher order models provides greater fit but with extrapolation even a small error in the coefficients might get magnified a great deal. Thus using lower order models helps prevent the exaggeration of the extrapolated values associated with minor coefficient estimation errors. 40 Chapter 5: CONCLUSIONS The improved LCM process extends and generalizes the LCM approach to a system level and has several improvements over the earlier versions by Ramakrishnan and Mishra [26] [27]. System level life consumption monitoring requires the systematically identifying all the parameters that drive the failure in a system and monitor those select few parameters that drive the failure. FMMEA is introduced as a new step in the LCM process that systematically identifies all failure mechanisms and models for all potential failures modes, and prioritizes the failure mechanisms to identify the high priority mechanisms. Ideally all failure mechanisms and their interactions must be considered for product design and analysis. In the life cycle of a product, several failure mechanisms may be activated by different environmental and operational parameters acting at various stress levels, but only a few operational and environmental parameters and failure mechanisms are in general responsible for the majority of the failures. High priority mechanisms provide effective utilization of resources and are those select failure mechanisms that determine the operational stresses and the environmental and operational parameters that must be accounted for in the design or be controlled. This enables the right suite of product parameters that need to be monitored for determining the damage and life consumed. The traditional remaining life estimation algorithm used in earlier approaches in life consumption monitoring process were overtly simplistic and took into account only the previous data point during iteration. A new algorithm called Leap Frog technique for remaining life prediction has been incorporated that takes care of the past data and is 41 sensitive to changes in health of system making it easier to quickly adapt to the rapid changes in degradation. The superiority of the Leap Frog method is validated by comparing it three other with three other prognostic methods using data from previous life consumption monitoring case study. Discussion on accuracy of remaining life predictions and the uncertainty issues associated with the predictions have been introduced. 42 Chapter 6: REFERENCES [1] Hu, J., Barker, D., Dasgupta, A., and Arora, A., ?Role of Failure-mechanism Identification in Accelerated Testing,? Journal of the IES, Vol. 36, No. 4, pp. 39- 45, July 1993. [2] Dasgupta, A. and Pecht, M., ?Material Failure Mechanisms and Damage Models,? IEEE Transactions on Reliability, Vol. 40, No. 5, pp. 531-536, December 1991. [3] JEDEC Publication JEP 122-B ?Failure Mechanisms and Models for Semiconductor Devices,? August 2003. [4] JEDEC Publication JEP 131 ?Process Failure Modes and Effects Analysis (FMEA),? February 1998. [5] JEDEC Publication JEP 148 ?Reliability Qualification of Semiconductor Devices Based on Physics of Failure Risk and Opportunity Assessment,? April 2004. [6] Bowles, J.B. and Bonnell, R.D., ?Failure Modes, Effects and Criticality Analysis ? What Is It and How To Use It,? Tutorial Notes Annual Reliability and Maintainability Symposium, 1998. [7] Bowles, J.B., ?Fundamentals of Failure Modes and Effects Analysis,? Tutorial Notes Annual Reliability and Maintainability Symposium, 2003. [8] Kara-Zaitri, C., Keller, A.Z., Fleming, P.V., ?A Smart Failure Mode and Effect Analysis Package,? Annual Reliability and Maintainability Symposium Proceedings, pp. 414 - 421, 1992. 43 [9] Signor, M.C., ?The Failure-Analysis Matrix: a Kinder, Gentler Alternative to FMEA for Information Systems,? Annual Reliability and Maintainability Symposium Proceedings, pp. 173-177, January 2002. [10] SAE Standard SAE J1739 ?Potential Failure Mode and Effects Analysis in Design (Design FMEA) and Potential Failure Mode and Effects Analysis in Manufacturing and Assembly Processes (Process FMEA) and Effects Analysis for Machinery (Machinery FMEA)? August 2002. [11] Bowles, J.B., ?An Assessment of RPN Prioritization in a Failure Modes Effects and Criticality Analysis,? Proceedings of Annual Reliability and Maintainability Symposium 2003, pp. 380 ? 386, January 2003. [12] IEEE Standard 1413.1-2002, IEEE Guide for Selecting and Using Reliability Predictions Based on IEEE 1413, 2003. [13] ?Guidelines for Failure Mode and Effects Analysis for Automotive, Aerospace and General Manufacturing Industries,? Dyadem Press, Ontario, Canada, 2003. [14] Failure Modes and Effects Analysis (FMEA): ?A Guide for Continuous Improvement for the Semiconductor Equipment Industry,? Technology Transfer #92020963B-ENG, SEMATECH, 1992. [15] Franceschini, F., Galetto, M., ?A New Approach for Evaluation of Risk Priories of Failure Modes in FMEA,? International Journal of Production Research, Vol. 39, No. 13, pp. 2991-3002, 2001. 44 [16] Bhandarkar, S.M., et al., "Influence of Selected Design Variables on Thermomechanical Stress Distributions in Plated Through Hole Structures," Transaction of the ASME - Journal of Electronic Packaging, Vol. 114, pp. 8-13, March 1992. [17] Rudra, A.B., et al., ?Electrochemical Migration in Multichip Modules,? Circuit World, Vol. 22, No. 1, pp. 67-70, 1995. [18] Black, J.R., ?Physics of Electromigration,? IEEE Proceedings of International Reliability Physics Symposium, pp. 142-149, 1983. [19] Howard, R.T., ?Electrochemical Model for Corrosion of Conductors on Ceramic Substrates,? IEEE Transactions on CHMT, Vol. 4, No 4, pp. 520 ? 525, December 1981. [20] Foucher, B., Boullie, J., Meslet, B., Das, D., ?A Review of Reliability Predictions Methods for Electronic Devices,? Microelectronics Reliability, Vol. 42, No. 8, pp. 1155-1162, August 2002. [21] Steinberg, D.S., ?Vibration Analysis for Electronic Equipment,? 2nd Edition, John Wiley & Sons, 1988. [22] Dasgupta, A., Oyan, C., Barker, D. and Pecht, M., ?Solder Creep-Fatigue Analysis by an Energy-Partitioning Approach,? ASME Transactions, Journal of Electronic Packaging, Vol. 114, No.2, pp. 152-160, 1992. [23] Society of Automotive Engineers, Recommended Environmental Practices for Electronic Equipment Design, SAE J1211, Rev. Nov 1978. 45 [24] Monthly Temperature Averages for the Washington, DC Area, accessed August 17, 2003. [25] Telcordia Technologies, Special Report SR-332: ?Reliability Prediction Procedure for Electronic Equipment Issue 1,? Telcordia Customer Service, Piscataway, N. J., May 2001. [26] Ramakrishnan, A., ?Health and Life Consumption Monitoring Using Sensor Technologies,? Master?s Thesis, Department of Mechanical Engineering University of Maryland, College Park 2001 [27] Mishra, S., ?Life Consumption Monitoring for Electronics,? Master?s Thesis, Department of Mechanical Engineering University of Maryland, College Park 2003 [28] Mishra, S., Pecht, M., and Goodman, D., ?In-situ Sensors for Product Reliability Monitoring,? Proceedings of SPIE, vol. 4755, pp. 10-19, 2002. [29] Swantech, SWAN? Technology, http://www.swantech. com/technology.html, Accessed on May 2004. [30] PC Guide, Self-Monitoring Analysis and Reporting Technology (SMART)?, http://www.pcguide.com/ref/ hdd/qual/featuresSMART-c.html, Accessed on May 2004. [31] General Motors, ?GM Oil Life Monitoring Systems?, http://www.gm.com/company/gmabity/environent/news_issues/news/oillifemonitor 041603.html, Accessed on June 2004. 46 [32] Ramakrishnan, A. and Pecht, M., ?Implementing a Life Consumption Monitoring Process for Electronic Product,? IEEE Components Packaging and Manufacturing Technology, vol. 26, issue 3, pp. 625-634, 2003. [33] Greitzer, F.L. and Ferryman, T.A., ?Predicting Remaining Life of Mechanical Systems,? Intelligent Ship Symposium IV April 2-3, 2001. [34] Engel, S.J., et al. ?Prognostics, The Real Issues Involved with Predicting Life Remaining,? IEEE Aerospace Conference Proceedings, vol. 6, pp.457-469, 2000.