ABSTRACT Title of Dissertation: QUANTILE-BASED LSTM REMAINING USEFUL LIFE PREDICTOR Yonatan Saadon, Doctor of Philosophy, 2021 Dissertation directed by: Professor Patrick McCluskey, Department of Mechanical Engineering Accurate prediction of the remaining useful life (RUL) of a degrading component is crucial to prognostics and health management for electronic systems, to monitor conditions and avoid reaching failure while minimizing downtime. However, the shortage of sufficiently large run-to-failure datasets is a serious bottleneck impeding the performance of data-driven approaches, and in particular, those involving neural network architectures. Here, this work shows a new data-driven prognostic method to predict the RUL using an ensemble of quantile-based Long Short-Term Memory (LSTM) neural networks, which represents the RUL prediction task to a set of simpler, binary classification problems that are amenable for prediction with LSTMs, even with limited data. This methodology was tested on two run-to-failure datasets, power MOSFETs and filtration system, and showed promising results on both datasets it demonstrates that this approach obtains improved RUL estimation accuracy for both the power MOSFETs and the filtration system, especially with a small training dataset that is characterized by a wide range of the RUL. QUANTILE-BASED LSTM REMAINING USEFUL LIFE PREDICTOR by Yonatan Saadon Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park, in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2021 Advisory Committee: Professor Patrick McCluskey, Chair Professor Hugh Bruck Professor Mark Fuge Professor Peter Sandborn Professor Mohamad Al-Sheikhly ? Copyright by Yonatan Saadon [2021] Dedication I would like to dedicate this work to my wife Noam, for her guidance and help for the past five years, to my dog and best friend Gucci, our cat Roger, and to my loving family, my father Yair, my stepmom Irit and to my amazing brother and sisters, Inbal, Liel, and Ohad. ii Acknowledgments I would like to thank my advisor Professor Patrick McCluskey for his support and mentorship throughout my graduate studies. I would also like to thank my committee members: Professor Hugh Bruck, for welcoming me into UMD as the director of graduate studies, and for his mentoring and support at my first year, Professor Mark Fuge, for inspiring me to investigate machine learning, after taking his class in my first semester, Professor Peter Sandborn, for replying to my email long before I joined UMD and giving me a tour at CALCE, and Professor Mohamad Al- Sheikhly, for teaching me everything I know about polymer physics, and for being there every time I needed friendly advice. I would also like to thank my lab colleagues: Maxim Serebreni, Subramani Manoharan, Jennifa Li, Zhaoxi Yao, He Yun, Sriram Jayanthi, and Kunal Ahuja. iii Table of Contents Dedication ....................................................................................................................................... ii Acknowledgments.......................................................................................................................... iii List of Figures .............................................................................................................................. vii List of Tables ................................................................................................................................ ix Introduction .................................................................................................................................... 1 Maintenance................................................................................................................................ 3 Prognostics .................................................................................................................................. 3 Experience based .................................................................................................................... 4 Analytical model based .......................................................................................................... 5 Data driven .............................................................................................................................. 6 Literature review ............................................................................................................................. 8 Liang et al. bearings RUL prediction using LSTM RNN ..................................................... 10 Karkulali et al. MOSFET prognostics using feed-forward neural network ......................... 13 Estimating RUL of Lithium-Ion batteries using RNN LSTM ................................................... 14 Power MOSFET RUL prediction comparison between two model based and one data driven methods ...................................................................................................................................... 16 iv A dual-LSTM framework combining change point detection and remaining useful life prediction ................................................................................................................................... 18 Remaining Useful Life Prediction for Rolling Bearings Using EMD-RISI-LSTM .................. 19 Our approach................................................................................................................................ 21 Deep learning ............................................................................................................................... 23 Artificial neural network (ANN) .................................................................................................. 24 Backpropagation .......................................................................................................................... 26 Recurrent neural network (RNN) ............................................................................................... 26 Long Short-Term Memory (LSTM) machines ............................................................................. 28 Project one: Quantile-based LSTM Remaining Useful Life prediction of MOSFETs. ................ 30 Background ................................................................................................................................ 30 Power modules ...................................................................................................................... 30 Power MOSFET.................................................................................................................... 31 Dataset one ................................................................................................................................ 33 Procedure ................................................................................................................................... 34 Data representation ................................................................................................................. 34 Training process ..................................................................................................................... 38 RUL prediction ....................................................................................................................... 42 Prediction performances ............................................................................................................ 44 Comparison to previous work.................................................................................................... 47 v Project one code overview ......................................................................................................... 49 First function .......................................................................................................................... 49 Project 1 pseudocode ................................................................................................................. 52 Project two: Quantile-based LSTM Remaining Useful Life prediction of filtration system. ....... 56 Dataset 2 .................................................................................................................................... 56 Procedure ................................................................................................................................... 62 Data representation ................................................................................................................. 62 RUL prediction ....................................................................................................................... 66 Prediction performances ............................................................................................................ 67 Training and validation datasets ............................................................................................. 67 Test dataset ............................................................................................................................. 71 Conclusions ................................................................................................................................... 75 Contributions................................................................................................................................. 76 Future work ................................................................................................................................... 77 Citations ........................................................................................................................................ 78 vi List of Figures Figure 1 Wang et al process ............................................................................................................ 4 Figure 2 Flowchart of the proposed Dual-LSTM framework [46]. .............................................. 18 Figure 3 RMSE comparison [46] .................................................................................................. 19 Figure 4 Guo et al. proposed framework [47]............................................................................... 20 Figure 5 fully connected network ................................................................................................. 24 Figure 6 Functions for ANN ......................................................................................................... 25 Figure 7 LSTM cell inner structure .............................................................................................. 29 Figure 8 LSTM network ............................................................................................................... 29 Figure 9 MOSFET structure ......................................................................................................... 31 Figure 10 Power MOSFET structure ............................................................................................ 32 Figure 11 complete accelerated aging system diagram [42] ......................................................... 34 Figure 12 Naive LSTM ................................................................................................................. 39 Figure 13 Classifier performance evaluation. Receiver operating characteristic (ROC) curves and the precision-recall curves showing the performances of the four classifiers. ............................. 41 Figure 14 Pipeline illustration. The module's measurements are given as input to the four LSTMs, each is trained to predict when a module has reached a certain quantile in its life (last half, third, quarter, and fifths). ...................................................................................................... 43 Figure 15 Validation performance. (a) Bar plots showing the error fraction of the four validation modules, when evaluating in different time points in the life of the module (half-life, last 1/4, 1/6, 1/8, 1/10, 1/12, and 1/14). ...................................................................................................... 45 vii Figure 16 test performance. (a) Bar plots showing the error fraction of the five test modules, when evaluating in different time points in the life of the module (half-life, last 1/4, 1/6, 1/8, 1/10, 1/12, and 1/14). .................................................................................................................... 47 Figure 17 comparison to previous work. RUL prediction performance assessment for module number 26, for GPR, EKF and PF as described for Celaya et al.[22], and for the quantile LSTM predictor. ....................................................................................................................................... 48 Figure 18 System of the experimental rig [67]. ............................................................................ 57 Figure 19 Filter under study .......................................................................................................... 59 Figure 20 Filters 1-8 from the training dataset ............................................................................. 68 Figure 21 Filters 9-16 from the training dataset ........................................................................... 68 Figure 22 Filters 17-24 from the training dataset ......................................................................... 69 Figure 23 Filters 1-8 from the validation dataset .......................................................................... 69 Figure 24 Training + Validation ................................................................................................... 70 Figure 25 Filters 1-8 from the test dataset .................................................................................... 72 Figure 26 Filters 9-16 from the test dataset .................................................................................. 73 Figure 27 Test Dataset .................................................................................................................. 73 viii List of Tables Table 1 RNN and SOM RUL predictions ..................................................................................... 11 Table 2 RNN and SOM RUL prediction on a new dataset ........................................................... 12 Table 3 computational time vs accuracy ....................................................................................... 13 Table 4 cycles error with different methods ................................................................................. 15 Table 5 cycles error with different methods ................................................................................. 15 Table 6 RUL prediction with error in parentheses........................................................................ 17 Table 7 Evaluation and analysis of prediction results of four models .......................................... 21 Table 8 The features included for RUL prediction. ...................................................................... 35 Table 9 The MOSFET modules used throughout this study for training, validation, and testing, and the respective lifespan of these modules ................................................................................ 37 Table 10 RUL prediction results for GPR, EKF, PF, and the quantile-LSTM for the last 88 minutes in the lifetime of module number 23. RUL prediction error is between parentheses. .... 49 Table 11 Suspension profile details .............................................................................................. 59 Table 12 Particle size data ............................................................................................................ 60 Table 13 Training set ? 24 samples .............................................................................................. 61 Table 14 Validation set ? 8 samples ............................................................................................. 61 Table 15 Test set ? 16 samples ..................................................................................................... 62 Table 16 filters set assignments and lifespans .............................................................................. 65 Table 17 MAE and M per classifier for training and validation datasets combined .................... 71 Table 18 MAE and M per classifier for the test dataset ............................................................... 74 ix Introduction Prediction of the Remaining Useful Life (RUL) is crucial for mitigating system shutdown and failure. It can also decrease the costs of maintenance by indicating the live status of the system. Existing strategies to address the challenge of RUL prediction may be categorized into (1) experience-based approaches, which aim to infer a simple global function describing failure, (2) analytical model-based approaches, which aim to derive a set of mathematical equations to represent the state of a system, and (3) data-driven approaches, which employ machine learning and pattern recognition techniques to evaluate the RUL based on data measurements from the system. Although numerous studies have been conducted to develop such techniques, these are mostly limited, hindered by the lack of a simple global function to predict the RUL with an experience-based approach, and the lack of sufficient data to construct a reliable model or train a robust machine learning classifier via a model-based or data-driven approach, respectively. Furthermore, it is desirable that an approach designed for RUL prediction would be capable of analyzing ?online? measurements and adjust the RUL by the observed changes in the system, by considering temporal dynamics and analyzing the long-term dependencies within the data. Currently, Recurrent Neural Networks (RNNs), and particularly, gated RNNs such as Long Short Term Memory machines (LSTMs), demonstrate the state of the art performance in time series forecasting by holding a long-term memory of a time series data and identifying long term patterns that contribute to the classification task. The main issue impeding the utilization of LSTMs to construct a robust predictor of the RUL is the requirement for a large dataset to train the network, which would allow it to learn the complex dependencies 1 between the measurements and the full range of the RUL at each given time point, whereas sufficiently large training datasets are currently unavailable. In this work, a framework was developed that overcame these limitations and facilitates LSTM-based RUL prediction by simplifying the classification task in a way that would eliminate the need for large training datasets and allow training on the smaller datasets that are currently available. The general classification problem of RUL prediction was converted, to a set of simpler classification tasks, of predicting whether the RUL is larger than a given threshold, for a set of considered thresholds. This demonstrated that each of these simplified tasks can be successfully addressed by training LSTMs, even with the current, relatively small, available datasets; a high accuracy for the LSTMs trained to predict whether the RUL is greater than a given threshold, at every time point. Building on these results, I developed an ensemble technique that incorporates different threshold-based classifiers, for which I observe such high prediction accuracy, and utilize these to accurately infer the precise RUL at every time point. This method can be applied to any given run to failure dataset when the failure mechanism and system/component are constant across the dataset. In this work, I will start with an explanation of the different categories of prognostics methods and a literature review of existing strategies for RUL prediction. Next, I will describe the approach that I developed and introduce deep learning and particularly LSTMs, to provide the intuition and motivation for utilizing LSTMs in this work. This work contained two projects, first, a dataset with temporal measurements and RUL of power MOSFETs, second, constant measurements and RUL of filtration systems. Before each project, I will provide a brief introduction, for the first project I will provide an introduction 2 to MOSFETs and power MOSFETs, and the first dataset origin, then for the second dataset, I will provide the data origin and the experimental rig breakdown. Maintenance To prevent the wear of the device, the manufacturer provides a maintenance schedule, based on the statistics of the reliability tests done by the manufacturer. These guides are sufficient in most cases, but, since the schedule is constructed on statistics, it will not be adequate when using these components for mission-critical systems. There are two main issues with following the manufacturer schedule, we can either replace parts too soon, which will cost more, or too late, due to an unforeseen incident as mentioned before, which can result in a system failure. To address these issues, we need to consider using monitoring based prognostic health management methods Prognostics Prognosis is an emerging field in mechanical engineering, which is applied to many areas, aiming to predict the remaining useful life (RUL) of the system to minimize its maintenance and downtime. Prognostics is a science aiming to accurately detect early signs of degradation, as well as to analyze failure modes and fault conditions. The monitoring of the component condition is done in-situ, the data is collected from each component and subsequently analyzed using one or more of the following three categories of methods. 3 Experience based Experience based ? inferred via simple reliability functions, such as using Weibull law to analyze the time to failure [1-3] or utilizing Paris law to capture the rate of defect propagation [4- 6]. Paris law The general equation for Paris law is as follow: ?? = ?(??) ? ?? Where: a ? crack length N ? load cycles C and m ? material coefficient ?? ? stress range This equation represents the relation between the crack growths over stress cycles to the stress range and two coefficients C and m. This function is widely used to correlate crack growth in many fields. Figure 1 Wang et al process 4 For example, in their work Wang et al [7], used Paris law to determine the RUL of an aircraft structure. They used Finite Element Analysis (FEA) to determine the stress range for each structural element, and used previous fleet data and degradation observation to determine the coefficients (C and m). The entire process is shown in Fig.1. Analytical model based Analytical Model based - a set of differential or algebraic equations derived to represent the behavior and degradation of a system. This approach is based on the physics models that represent the behavior of the components and the system. This approach is been tested in different fields, for example, Yu et al[8] developed a new stress-based model of fatigue for bearings. Matej et al[9] used stochastic dynamical models to perform prognostics on gear health, and Celaya et al [10] developed a model-based methodology for predicting the remaining useful life of electrolytic capacitors. Analytical model based prognostics is broadly used in many fields. This method is especially preferred when dealing with a complex system. However, not enough data has been collected to date to reliably enable the application of a data driven approach. Extended Kalman Filter (EKF) The general form of EKF is as follow: ?? = ?(???1, ??) + ?? and ?? = ?(??) + ?? where both h and f are nonlinear equations. ?? and ?? are the process and observation noises, they both assumed to have zero mean, and ?? is the control vector. ?? is the predicted state and ?? is the predicted measurement. 5 Particle Filter (PF) The main idea behind PF is the construction of the Probability Density Function (PDF) that will represent the state based on the available information. We create the PDF using a set of points (particles) that represent a sample of values from an unknown state with a related set of weights that represent the discrete probability masses. The set of points are recursively updated from the nonlinear process model, this process can be summarized as a technique of recursive Bayesian filter implemented with Monte Carlo (MC) simulation. Data driven Data driven based approaches convert sensors measurements (such as thermal, vibration acoustic emission, etc.) to quantitative information and subsequently utilize learning tools to estimate the RUL (such as Wavelet Packet Decomposition, WPD[11-13]). These learning tools include, but are not limited to, artificial neural networks[14-18] Hidden Markov models[19-21], and Support Vectors Machines[22-25], as well as some ensembled approaches that integrate multiple learning models for combined prediction[26]. The state of the art in the artificial intelligence field is deep learning, a family of machine learning approaches that utilize the artificial neural network to predict the RUL of a system. Hidden Markov Models (HMMs) Are stochastic models that enable modeling a Markovian process given hidden states. HMMs are based on augmenting a Markov chain, which is a model that learns and provides information on a sequence of random variables, or states (of the model) that have a finite set of values that could 6 be assigned to them. The assumption, on which a Markov chain is established, is that only the current state is important and useful for the prediction of the next state, whereas the previous states before it do not affect the prediction of the next state, only by affecting the calculation of the current state. Hence, HMM can be thought of as having ?short memory?, where the past is not used for predicting the future, only the present. More formally, in a sequence of state variables X_1?.X_(n+1), the Markov chain and HMMs only consider time point (n) when evaluating time point n+1. Pr (?_(?+1)=???_0=?_0,?_1=?_1,?,?_?=?_? )=Pr (?_(?+1)=???_?=?_? ) Also, HMMs require the pre-evaluation of the transition probabilities. Overall, HMM-based time series prediction relies on a short-term memory assumption, which may not be valid for RUL prediction, and require a separate step of comprehensive training. Support Vector Machines (SVMs) Is a supervised, discriminative machine-learning algorithm, that may be employed for either classification or regression tasks. SVMs are defined by finding a separating hyperplane, which categorizes the training examples into the defined two classes (labels). The basic training algorithm of the SVMs is meant to identify the separating hyperplane that maximizes the margin, which is the distance between the hyperplane and data points located on each side of the hyperplane (which would normally belong to different classes). The training points that are used to define the hyperplane by this function (and are hence located on the margin) are termed the support vectors (SVs). After training the classifier, new examples are classified based on their location relative to the defined hyperplane. If the two classes of training examples are not 7 linearly separable, it is possible to train SVMs with the following: (1) SVMs with soft margin, which uses a hinge-loss function, to allow some level of points that may be located on the ?wrong? side of the margin, but minimizes such number of points. However, using SVMs with soft margins would normally still require an ?almost linear? separation of the training examples. Alternatively (2) using a Kernel function can fit a maximum margin hyperplane to a transformed feature space, where the transformation applied to the feature space may be non-linear and high dimensional. However, this requires knowing the transformation that would make the data separable (or the dimension/function that would separate the data) which is highly challenging. Hence, SVMs are not ideally suitable for time series prediction, especially when there may be non-linear relations between variables that are relevant for prediction. Literature review After reviewing the different approaches for prognostic and RUL estimation, I decided to focus on a fused data and physics-driven method that showed promise in improving the estimates RUL in complex systems. Recently artificial neural network has shown tremendous success in solving complex problems such as; computer vision[27, 28], speech recognition[29-31], and medical diagnosis[32-34]. After reviewing the different capabilities of ANN we looked for an ANN architecture that will fit our problem, since we have a time series prediction we looked into different methods that solve this type of problem. The most used architecture of ANN for this type of problem is the recurrent neural network[35-38]. However, this method is not optimal due to the explosion or vanishing gradient (which will be discussed in the RNN chapter). To overcome this issue we chose the long short term memory (LSTM) architecture, this structure 8 allows us to maintain long term memory and showed great potential in similar problems as ours in time series prediction in different fields. Pankaj et al.[39] showed the capability of RNN LSTM to detect anomalies in data. They applied the network on four datasets: (viz. ECG, space shuttle, power consumption, and engine sensors data), where they showed high precision (over 93%) in all of them. The results were also compared to standard RNN and showed that LSTM is more accurate overall, for example for the space shuttle dataset LSTM scored 0.93 compared to 0.89 with RNN. Xiaolei et al.[40] compared different methods to predict traffic speed. In their work they applied the following methods; Auto Regressive Integrated Moving Average (ARIMA), Kalman filter, Support Vector Machine (SVM), Elman NN, Time Delay Neural Network (TDNN), nonlinear autoregressive with exogenous inputs (NARX), and LSTM. They showed by comparison that LSTM is the best fit for that type of problem (time series prediction). Thomas et al.[41] deployed the LSTM network to predict stock directional movements in price, by doing so they showed that LSTM can perform on large datasets, they used S&P 500 from 1992 to 2015, also they showed that LSTM outperforms random forest, deep net, and simple logistics regression. After reviewing this literature I could carefully consider the advantages and disadvantages of the approaches for time-series prediction and I could make an informed decision that LSTM would be the most fitting choice for predicting RUL. All of these different approaches facilitate the prediction of the RUL and were widely utilized in several different fields. Because our first dataset is based on the accelerated aging run to failure MOSFET data [42] that is analyzed and learned with a neural network, this section will be focusing on RUL estimation with the following literature: ? Applying a recurrent neural network to predict the RUL of bearings [43]. 9 ? Recent (2019) work done on similar MOSFET run to failure data using different neural network methods [44]. ? A similar neural network method was applied to Lithium-Ion batteries [45]. ? The paper that published the dataset we used and compared the performance of three algorithms, two model based methods; extended Kalman filter and particle filter, and one data driven method based on the Gaussian process regression framework [42]. ? A dual-LSTM framework combining change point detection and remaining useful life prediction [46]. ? Remaining Useful Life Prediction for Rolling Bearings Using EMD-RISI-LSTM [47]. Liang et al. bearings RUL prediction using LSTM RNN Liang et al. used Long Short Term Memory (LSTM) networks, a type of Recurrent Neural Network (RNN), to predict the RUL of bearings. They started by acquiring run to failure data of eight bearings, then they extracted eight relevant features from that data. Next, they added six related similarity features to overcome the large variance of classical statistical features, and they used feature selection and fusion to find the most relevant features , which yielded eight final features to be used. These features were normalized by using the distribution of these features when measured during initial inspections. They used the normalized features from the data obtained from eight bearings to train an LSTM RNN network. After training the network, it was applied to predict the failure time and the RUL of a test set that was left out and unseen during training. They compared their results with a self-organizing map (SOM) and the results are shown in the following table: 10 Testing Current time True RUL (s) Predicted RUL RNN error SOM error dataset (s) (s) (%) (%) Bearing1_3 18,010 5,730 3,520 43.28 -31.76 Bearing1_4 11.380 2,900 1,100 67.55 62.76 Bearing1_5 23,010 1,610 1,980 -22.98 -136.03 Bearing1_6 23,010 1,460 1,150 21.23 -32.88 Bearing1_7 15,010 7,570 6,220 17.83 -11.09 Bearing2_3 12,010 7,530 4,680 37.84 44.22 Bearing2_4 6,110 1,390 1,660 -19.42 -55.40 Bearing2_5 20,010 3,090 1,410 54.37 68.61 Bearing2_6 5,710 1,290 1,470 -13.95 -51.94 Bearing2_7 1,710 580 900 -55.17 -68.97 Bearing3_3 3,510 820 790 3.66 -21.96 Mean of error 32.48 53.24 Table 1 RNN and SOM RUL predictions We can see from the table that LSTM RNN predictions were better than SOM. After analyzing the test dataset Liang et al. Used the trained LSTM RNN and SOM on a new industrial available dataset of generator bearings of wind turbines. The results of that test are shown in the following table: 11 Testing Current time True RUL Predicted RUL RNN error SOM error dataset (h) (h) (h) (%) (%) FJ1 300 183 108 40.9 -210.35 FJ2 120 58 18 69 28.96 FJ3 135 40 38 5 -117.5 FJ4 130 48 46 4.1 -8.33 FJ5 320 166 124 0.25 -14.16 FJ6 330 109 131 -20.2 22.94 Mean of 23.24 67.09 error Table 2 RNN and SOM RUL prediction on a new dataset Again, we see that LSTM RNN predicted more accurately the RUL compared to SOM. In conclusion, we can learn that LSTM RNN can be used to predict the RUL. Since this paper did not share the actual procedure used to obtain these results, it is hard to understand why their results are not as accurate as we believe can be achieved using this method. The only changes we can offer to improve their results will be to first minimize the output range and normalize it, as LSTM RNN can achieve much higher accuracy when the output range is smaller, and secondly, we will suggest increasing the dataset size by randomly sample it. 12 Karkulali et al. MOSFET prognostics using feed-forward neural network In their work, Karkulali et al. used a feed-forward neural network with three inputs one output, and one hidden layer that contain from one to forty hidden neurons. Their data contained only two modules with a limited number of data points, they separated the data into training and test datasets and were able to achieve almost 85% accuracy on these modules. The accuracy was |???? ????????????? ???| measured using the following equation: ?? = 1 ? when RA is the ???? ??? relative accuracy. They summarized their sensitivity test on a different number of hidden neurons (??), their computational time and their relative accuracy, using the equation mentioned, in the following table: ?? Computational time (s) Relative accuracy 1 104.89 0.8413 5 107.32 0.8493 10 109.88 0.8049 20 113.73 0.7865 30 135.24 0.7909 40 150.72 0.4939 Table 3 computational time vs accuracy When applied their network on a new dataset (same module), that had different behavior the network could not estimate the RUL correctly. Their main issue was the lack of data, a small dataset as used in this work is not adequate to allow the network to train properly. Another issue 13 is the use of a feed-forward network, in which the data is fed to the network as steps and there is no memory in the network, this can cause the training of the system to create dependencies between steps and creating overfitting. Given the drawbacks mentioned above the results of Karkulali et al. are less relevant to this work but were mentioned because they addressed MOSFET RUL estimation using a neural network. Estimating RUL of Lithium-Ion batteries using RNN LSTM Yongzhi et al. used long short term memory recurrent neural networks to predict the remaining useful life of lithium-Ion batteries. They used two layers of LSTM, the first layer contained fifty hidden units, and the second layer contained one hundred hidden units. This is a large network and its large size can cause an overfitting issue, as presented in their work, to overcome this issue Yongzhi et al. used the dropout method. This method, which was developed by Google, prevents complex co-adaptions when training the network and helps to reduce the risk of overfitting. By comparing their results to SVM and simple RNN they showed the advantage of using LSTM RNN for predicting RUL. Their network attempt to estimate the RUL on four cells in two scenarios the first was after training the network on fifty percent of each cell data and the second was done after training on seventy percent of each cell data both without offline training data, which means that each network was trained and based on separate cell. Also, they estimate the RUL of cells one and three when using offline data that was obtained from an additional two cells, which underwent the same conditions as cells one and three, for training the network. Their results are as follow: The results of cells 1-4 without offline data using RNN LSTM compared to SVM and simple RNN are shown in the table below. 14 Cell 1 Cell 2 Cell 3 Cell 4 Method Stating Error Stating cycle Error Stating cycle Error Stating cycle Error cycle LSTM 253 -3 285 48 289 35 278 58 RNN 354 15 399 26 404 19 389 14 SVM 253 -21 285 -34 289 40 278 -47 354 30 399 42 404 23 389 15 RNN 253 135 285 195 289 181 278 - 354 78 399 95 404 72 389 88 Table 4 cycles error with different methods It shows that in most cases LSTM RNN is superior to the other methods and can predict more precisely the RUL of the cells. As mention before Yongzhi et al. also used offline data to create a network and then used a smaller portion of the data (120 cycles) to predict the RUL of cells one and three, the results were compared to particle filter (PF) method and simple RNN that used the same portion of the data. The results of that comparison are shown in the following table. Cell 1 Error Cell 3 Error LSTM RNN -19 40 PF 52 58 RNN 110 93 Table 5 cycles error with different methods 15 Again, it shows the advantage of using LSTM RNN over other methods. In summary, this work showed the potential of LSTM RNN in predicting RUL, although they may have used the too large network to analyze the data, as the data does not seem large enough, and had to overcome this issue by applying the dropout method they managed to show the advantage of LSTM RNN in predicting time series data. Power MOSFET RUL prediction comparison between two model based and one data driven methods Celaya et al. were the same group from NASA that collected the data that we are using in our work. In their work they made the following assumptions: ? They used ????(??) as their single health indicator feature. ? The die-attach failure mechanism is the only degradation that happens during accelerated testing. ? ????(??) is responsible for the degradation from nominal through failure condition. ? 0.05 increase in ????(??) is set as a failure threshold. ? Each prognostic tool, out of the three that were used, will predict the RUL from a set time point, hence the future load is estimated to remain similar. In their work, they used the same accelerated aging data of MOSFETs and it was analyzed using the following three methods: Gaussian Process Regression (GPR): this data driven approach is used to estimate degradation state using the training measurement data. To predict the state of degradation we will first 16 assume a prior distribution using prior knowledge [48], then we will modify the distribution to fit our measurements with a probabilistic function for regression over the training data [49]. The result of this process will be a mean function that describes the behavior of the data and additional functions that describe the uncertainty. EKF and PF were explained in the previous chapter. After applying these three methods, they summarized the results in the following table: ?? RUL GPR EKF PF 140 88 N/A 64.95 (23.02) 77.65 (10.35) 150 78 N/A 80.22 (-2.22) 65.85 (12.15) 160 68 N/A 56.64 (11.36) 58.33 (9.67) 170 58 N/A 50.15 (7.85) 49.47 (8.53) 180 48 73.2 (-25.2) 42.75 (5.25) 38.68 (9.32) 190 38 33.4 (4.6) 30.35 (7.65) 27.14 (10.86) 195 33 17.6 (15.4) 18.57 (14.43) 24.76 (8.24) 200 28 14.6 (13.4) 17.24 (10.76) 21.09 (6.91) 205 23 13.8 (9.2) 18.28 (4.72) 16.66 (6.34) 210 18 11.8 (6.2) 13.46 (4.54) 14.68 (3.32) Table 6 RUL prediction with error in parentheses This work presents the results that were acquired using the same data as used in our work, all of the techniques were underestimating the RUL and can result in an earlier replacement of parts to 17 avoid failure. The main setback of this work is the assumption that the load will continue to be the same. This assumption could not be applicable when estimating a real-time working module. A dual-LSTM framework combining change point detection and remaining useful life prediction Shi et al. used two-stage prediction, first they normalize the data, smooth it, and selected the sensors, and then they added the RUL labels. They trained two classifiers, the first classifier was trained to recognize the start of degradation, when the first classifier indicates it the second classifier is used to determine the RUL. The full flowchart of the process can be seen in the following figure: Figure 2 Flowchart of the proposed Dual-LSTM framework [46]. This method was used on two publicly available turbofan engine degradation datasets. They compare the results of their method with Vanilla LSTM, RNN with fixed change point, and 18 LSTM with fixed change point. The comparison between the methods can be seen in the following figure: Figure 3 RMSE comparison [46] The results show that in most cases LSTM is more accurate than RNN for predicting RUL and that adding stages for predicting the RUL is increasing the prediction accuracy. Remaining Useful Life Prediction for Rolling Bearings Using EMD-RISI-LSTM In their work, Guo et al. applied empirical mode decomposition (EMD) and long short-term memory (LSTM) network, to improve accuracy and robustness under different working conditions. The architecture integrates three parts. First, the failure vibration signal decomposed 19 into several intrinsic mode functions (IMFs), a residual through EMD decomposition, in parallel, to select the IMFs with more degradation features, a new method named RISI (representative IMF selection index), which is based on cosine similarity, and Euclidean distance. In the next step, the LSTM model is trained for each IMF and residuals. The final step will be using the selected IMF prediction and utilize it to determine the RUL prediction. The entire process can be seen in the following figure: Figure 4 Guo et al. proposed framework [47] To evaluate their proposed framework they compared their method with EMD-LSTM, LSTM, and BPNN (back propagation neural network). The results can be seen in the following table: Current time True RUL Predicted Guo et al. EMD- LSTM BPNN (s) (s) RUL (s) (%) LSTM (%) (%) (%) Bearing 1-3 18010 5730 4740 17.28 24.35 43.28 -31.76 Bearing 1-4 11380 2900 1730 40.34 60.43 67.55 62.76 20 Bearing 1-5 23010 1610 2050 -27.33 -24.54 -29.76 -136.03 Bearing 1-6 23010 1460 1960 -34.25 -24.65 35.76 -32.88 Bearing 1-7 15020 7570 7180 5.15 13.98 17.28 -11.09 Bearing 2-3 12000 7530 8410 -11.69 -16.87 -19.43 44.22 Bearing 2-4 6100 1390 1830 -31.65 -22.74 54.23 -55.40 Bearing 2-5 20010 3090 3370 -9.06 20.34 27.28 68.61 Bearing 2-6 5720 1290 1470 -13.95 -15.98 -16.97 -59.49 Bearing 2-7 1710 580 290 50 -52.57 -55.17 -68.98 Mean value 22.10 29.87 32.48 40.65 Score 0.31 0.29 0.26 0.06 Table 7 Evaluation and analysis of prediction results of four models The results show that for RUL predictions, LSTM is more accurate than BPNN and that by improving the input of the LSTM we can further improve its accuracy. Our approach After reviewing all the relevant literature, I reasoned that deep learning approaches are preferred for estimation RUL, and thus chose a deep learning approach fused with a physics- based approach that was applied to the data using my knowledge of different failure mechanisms. I searched for a dataset that is large enough to allow me to properly train the network. After reviewing the different neural network structures and their applications, I decided that LSTM RNN would be the optimal fit to address the challenge of RUL prediction. I found two datasets the first was from the NASA website of accelerated aging of MOSFET that includes 42 modules the second was filtration system run to failure given by 21 PHME2020 fifth European conference of the prognostics and health management society 2020, as part of the PHME2020 data challenge. At first, we tried applying RNN LSTM on the first dataset but we were unable to convert the training and achieved very poor results, after reviewing the lifespans of the modules we realized we need to find a better way to train the network, as we do not have enough data to train the network. We realized that with the amount of data we have, we could not train a network that can answer such a complex question. Therefore, we decide to simplify the question and ask the classifier to answer a simpler question. Instead of asking what the RUL is in each time point we asked did we passed a certain threshold at that point. By changing the question to a simpler question, we were able to use a much smaller dataset and achieve a more accurate result. After changing the question we so dramatic increase in the accuracy, but, we still had the issue of the wide lifespan range. We saw that issue since we had high accuracy in modules with similar lifespan when selecting the thresholds that were in their lifespan. In order to overcome that issue, we changed the thresholds to quantiles, that change allowed us to accurately predict RUL regardless of the lifespan range. To prevent overfitting of the training data, we used a relatively small network with a single LSTM cell and five hidden units, which has a limited number of parameters and hence is less prone to memorization and overfitting . This small architecture further allowed a robust prediction without any filtering techniques, such as dropout, which were necessary for previous studies. To further enlarge the dataset that we had, we randomly segmented pieces of the data, which facilitated a fivefold increase in the sample size. After the training of the network, we were able to achieve high accuracy on the test set that was constructed out of never-seen modules, which were entirely left out during training. To describe the approach that enabled this level of accuracy, the next chapter will 22 provide the background and basics of neural networks and backpropagation, with a focus on recurrent neural networks and long short-term memory algorithm. This review would provide the background and reasoning to the approach designed here, and why it is highly fitting to resolve the RUL prediction problem. Deep learning Deep learning enables a non-linear separation using multiple levels of abstract data via multiple processing layers of computational models. Part of the learning process is the representation of the data, which should be optimized, to boost the ability of the network to process and learn the data. The representation of the data is fed into different layers of network architectures, each, in turn, applies non-linear functions, and learns how to weight different parts of the dataset, to optimize the prediction objective. The commonly used objective in neural networks is the minimization of a loss function. These methods allow the community of artificial intelligence an advance in solving different complex problems that were unsolvable for many years and not amenable to standard techniques. Deep learning tools have a unique capability in uncovering and learning complex structures in high-dimensional data. Naturally, with the recent increase of high computational power, they became broadly used in various fields, including business, government, and science. These approaches are currently considered the state of the art for speech recognition [50-52], image recognition [53-56], and outperform many machine-learning algorithms in predicting reliability [43, 45], clinical applications [57], etc. Deep learning is a wider family of machine learning methods that are based on neural network architecture. 23 Artificial neural network (ANN) This architecture was inspired by biological systems, although there is no real resemblance between artificial neural networks (ANN) and the neurons observed in biological systems. The purpose of a neural network is to alter input data into the desired output. The structure of the network is based on weights and non-linear functions, the data is inserted into the network through the input layer, and each node is multiplied by the weight of that node and applied a Figure 5 fully connected network non-linear function the results is then fed to the next layer along with the results of all the other nodes. This procedure continues in each layer, input and hidden layers until reaching the output layer as shown in Fig.5. This type of network is called the vanilla network and each step can be represented by the following function: ?? = F(?? ? ??) Where: ?? is the input to the i node. 24 ?? is the weight of the i node. ?? is the output of the i node. And F can be either one of the following non- linear functions: tanh, sigmoid, and RELU (Fig.6). Each of these functions is usually used for a different purpose, for example; for recurrent neural networks, we will use tanh, for image recognition RELU, etc. Figure 6 Functions for ANN To use the network we must first train it. The training is done by adding labels to the data that define the required output from the system. For example, if we want the network to distinguish between a dog and a cat, we will have to label the dog images as dogs and the cat?s images as cats. Then we will feed the images to the network, after each iteration the network will know if it was able to distinguish between the images and if the network was wrong it would modify the weights to adjust the output. After we feed the network with enough images and let it run the required number of iterations, the network should be able to decide when seeing new images (not the ones used in training) to distinguish between a cat and a dog image. Although this procedure seems simple, we need to understand that weights change is done to each node and each node will be usually fully connected to the next step (fully connected network) as seen in Fig.5. In each step, we will apply the nonlinear function and after each iteration, we will modify the weights from the first layer, this process requires heavy computational abilities and is the main reason why it was neglected for a long 25 time. To overcome that issue a more simple calculation approach was needed, it was only in the mid-1980s where the backpropagation was introduced. Backpropagation Ever since pattern recognition started, researchers aimed to replace their hand calculations with a trained multi-layer network. It was only until the mid-1980s that the solution was developed. Several different groups during the 1970s and 1980s [58-60] discovered the idea that this could be done, and that it worked, independently. The idea was to train multilayer architectures by simple stochastic gradient descent. The procedure of backpropagation that is used to compute the gradient concerning the weights can be seen as the chain rule for derivatives. This approach reduces the computing resources needed to train the network dramatically, as it computes the gradient change backward with respect to the output, and once we compute the gradients, the computations of the weights of each module will be straightforward. Even though this method reduces the computational load, it was only until 2010 when the computational cost was reduced to a point were ANN was considered again. Recurrent neural network (RNN) The architecture that gained the most out of backpropagation was RNN. This method is optimal for speech and language as they involve sequential inputs. RNN processes the input elements in order and integrates the output of each element to the input of the next one, which will require even more gradient and weights computation, hence the advantage of backpropagation. The 26 distinctive structure of RNN, and its training way, allows it to fit perfectly when needed to predict the next word in a sentence [61], the next character in a word [62], or even more complex tasks such as; predicting the remaining useful life of an ion battery [45], bearings [43] and more. RNN can be seen as a deep feedforward network, once unfolded, where the layers share the same weights. Its structure can be seen in the following equation: ? ?? = ? (? ( ??1 ? )) ?? Where: ?? is the input of that time step. ?? is the weight of the i node. ?? is the output of that time step. ???1 is the output of the previous time step. F is the same as mentioned in ANN Although their purposes are to learn long-term dependencies, it is not the case. One of the reasons for the lack of long-term memory can be found in the backpropagation method. Even though backpropagation is beneficial to RNN, it has a major flaw, due to a large number of time steps, the gradient can grow or shrink drastically and cause it to explode or vanish. This issue led to the development of the Long Short Term Memory (LSTM) method. 27 Long Short-Term Memory (LSTM) machines Long Short-Term Memory (LSTM) networks are a type of Recurrent Neural Networks that use a time series as input for various prediction tasks, and is composed of the following components: ?? = ??(???? + ?????1 + ??) ?? = ??(???? + ?????1 + ??) ?? = ??(???? + ?????1 + ??) ?? = ??????1 + ?????(???? + ?????1 + ??) ?? = ?????(??) Where the initial values are ?0 = 0 and ?0 = 0. ? denotes the Hadamard product. ?? are the input vectors to the LSTM unit (sensors measurements). ?? , ?? and ?? are the activation vectors for the forget gate, input gate and output gate, respectively. ?? is the output vector of the LSTM unit, and ?? is a cell state vector. ? and ? are the weight matrices and ? are the bias matrices that are learned during training. ? are the non-linear functions, where ?? is a sigmoid function and ?? is the ???? function [63]. Although LSTM requires additional cells (extra gates) it enables the solution for the vanishing and exploding gradient by allowing the cell to forget and to change the inner weights of different cells. Also, the structure allows it to overcome long time lags, discard noise, and much more. Unlike Markov models, LSTM can handle unlimited state numbers. LSTM is also able to distinguish between two or more widely separated occurrences. 28 LSTM networks have shown remarkable results in many fields compare to traditional RNNs when having several layers for each time step. It is widely used in encoders and decoders, such as Google translate, and many more machine translations. The structure of the LSTM cell is as follow: Figure 7 LSTM cell inner structure Figure 8 LSTM network The function of LSTM is shown above and can be summarized: ? ? ? ( ) = ( ? ? ? ? ) ? ( ??1) ?? ? ???? 29 Each letter represents a gate where i is the input gate its value indicates how much we write to the cell. f is the forget gate which indicates how much we delete from the cell. o is the output gate which indicates how much to reveal the cell and finally g is called the gate gate, and it indicates how much to write to the cell. Project one: Quantile-based LSTM Remaining Useful Life prediction of MOSFETs. Background Power modules The heart of a power electronic system is the power semiconductor switching module. Power modules can be based on many different semiconductor switching technologies: ? Silicon Metal Oxide Semiconductor Field Effect Transistor (MOSFETs) (1980s). ? Silicon Insulated Gate Bipolar Transistor (IGBTs) and MOS Controlled Thyristors (MCTs) (1990-2000s). ? Silicon Carbide MOSFETs and GaN High Electron Mobility Transistors (HEMTs) for higher frequency applications. Our work will focus on power MOSFET modules. 30 Power MOSFET General characteristics Standard n-channel MOSFETs have four terminals, source (S), gate (G), drain (D), and body (B) terminals as shown in fig.9. The terminals are usually reduced to three, since the body (B) terminal is typically connected to the gate (G) terminal, to prevent it from a free float, which can limit transistor control. The gate voltage controls the depth of the channel, through a capacitor formed by the gate substrate and a thin Figure 9 MOSFET structure layer of (yellow line in fig.9) SiO2 (grown or deposited) or rare earth metal oxide (e.g. HfO2). The potential of the gate is relative to the source. The electrons are responsible for the main power current flow through the source terminal and exit through the drain terminal. MOSFET lifespan under standard conditions is too long for lab testing since the MOSFET fabrication process improved over the years, hence an accelerated testing approach is typically used. When using accelerated testing, as the data we used for this work, it is important to make sure that the failure mechanism in the accelerated test remains the same as in the standard conditions. 31 Power MOSFET failure mechanisms The power MOSFET structure is different from a normal MOSFET structure since it handles high power levels. Power MOSFETs are structured, in most cases, vertically as seen in fig.10, compared to the normal MOSFET planar structure as seen in fig.9. This unique structure of the power MOSFET causes several failure modes for example; Gate oxide breakdown: to Figure 10 Power MOSFET structure allow an increase in switching speeds, manufacturers had to decrease the gate dielectric thickness. This reduction caused the gate oxide region to be vulnerable to damage from high gate voltage. Hence, when operating power MOSFET with gate voltage values that are above device limitation the power MOSFET will be subjected to a reduced lifetime and it may even cause immediate failure. Maximum drain to source voltage: power MOSFETs specifications indicate the maximum drain to source voltage, exceeding that voltage can cause breakdown and may damage other circuits in the device due to higher power dissipation. These limitations may be breached by many causes that are not design- related or even usage-related such as, electrostatic discharge which may cause exceeding the gate voltage limit and can cause instant failure [64]. In this work, the MOSFETs dataset was used where the major failure mechanism was die- attached degradation, which is typical for discrete devices with lead-free solder die-attachment. The data was given by the work of Celaya et al. [42] where they determined in their experiment that die-attach degradation was the failure mechanism resulting in an increase in ON-state 32 resistance due to its dependence on junction temperature. Increasing resistance, thus, can be used as a precursor of failure for the die-attach failure mechanism under thermal stress, therefore the Rdson was measured as the resistance increased as the die-attach degrades under high thermal stresses. Dataset one The dataset used throughout this project is of accelerated aging of MOSFET IRF520Npbf (TO- 220 package), where the failure mechanism is die-attach degradation due to thermal and power cycling stress conditions, which is typical for lead-free solder die attachment [23]. This failure is associated with a mismatch between the coefficients of thermal expansion (CTE) in the component structure, which induces thermo-mechanical stresses to the component caused by the thermal cycling overstress. The dataset, which describes thermal overstress applied to the devices to achieve accelerated aging, was download from the NASA website [24]. The failure conditions were thermal run-away, Latch-up, and failure to turn ON due to loss of gate control. The thermal cycles were induced by not using an external heat sink, thus substantially reducing the heat dissipation capabilities. The thermal cycling was controlled using a thermal sensor on the device case. The power cycling had a gate voltage which was a square wave signal with an amplitude of 15V, a frequency of 1KHz, and a 40% duty cycle. The drain-source was biased at 4V dc and a resistive load of 0.2 ? and was used on the collector side output of the device [22]. Temperature control was employed within the low and high ranges. The aging process used is described in detail in [25], and the accelerated aging methodology is presented in detail in [26]. The complete accelerated aging system diagram is shown in Fig.11 33 Figure 11 complete accelerated aging system diagram [42] Procedure Data representation To enable the utilization of LSTMs to predict the RUL given run-to-failure data, we first represent the measurements that will be provided as input to the LSTM. The data utilized includes seven features, summarized in Table 8. The transient features were consistently collected during the entire experiment, whereas the steady-state data was not, and thus containing time gaps. Therefore, the five steady-state features and the two transient features were synchronized by the steady-state data point. Each transient data point was assigned with a steady-state value by finding the steady-state sample with the closest time to the transient data point. Next, the drain voltage was divided by the drain current (square signal), to obtain Rds(on). We used the 0.25 and 0.75 quantiles to produce the high and low values of each pulse and remove noisy outliers, which were utilized as the transient input features. 34 Feature Type Supply voltage Steady state Drain-source voltage Steady state Drain current Steady state Package temperatures Steady state Flange temperatures Steady state Drain current Transient Drain voltage Transient Table 8 The features included for RUL prediction. The seven features for each model ?? yielded a feature table of size 7X?? for each module, where ?? is the number of time points measured for that module. The RUL for each time point in ?? was converted into constant intervals of timeline data. The gaps (stoppage of the experiment at the end of each day) were removed to maintain an "active time" measure (i.e., the actual time that the module has been operating). Therefore, each time point ? in a ?? table was assigned with a distinct RUL, denoting how much more "active time" module ?? has left until failure. To allow RUL prediction at different time points in the life of a module, the converted run to failure data described above was segmented, to produce multiple fractions of measurements for each module. The segmented data was used for training and testing, where each data point describes a certain time frame of measurements. Therefore, each feature table of module ?? was segmented into 7X(??/k) tables, where ??is the number of time points measured, and K was set 35 50 (through parameter optimization on the training dataset), resulting with ~??/50 segments for each module. Due to large and inconsistent measurement intervals and measuring errors, for instance, fault readings from temperature sensors (over 1000 degrees), 17 modules were discarded and 25 were used for training and evaluation. The training dataset that was used to train the network included 16 modules. The performance was evaluated on the validation dataset, including 4 modules, for parameter tuning and algorithm design. Eventually, and only after the final approach has been established, it was tested on the test dataset, of five modules that were unseen during the entire training process. Module number Set assignment Lifetime minutes 1 training 186.0678 2 training 146.6166 3 training 859.1858 4 training 704.5452 5 training 135.5832 6 training 284.4624 7 training 215.8803 8 training 231.8026 9 training 614.3334 36 10 training 308.783 11 training 88.28323 12 training 320.5675 13 training 437.0867 14 training 477.2924 15 training 299.7784 16 training 388.3094 17 validation 950.0717 18 validation 897.989 19 validation 821.034 20 validation 868.3347 21 test 1404.347 22 test 311.559 23 test 358.1089 24 test 395.9074 25 test 290.7152 Table 9 The MOSFET modules used throughout this study for training, validation, and testing, and the respective lifespan of these modules 37 Training process Because binary classification tasks can achieve higher performance rates when trained with smaller datasets [65], compared with complex classification tasks such as directly predicting the RUL, we re-define the classification problem of the RUL prediction as a set of binary classification problems. This is crucial given the relatively small dataset that does not facilitate training a predictor to directly predict the RUL (Fig.12). We first train a sequence-to-sequence LSTM to predict, from the input data features, what is the precise remaining useful life (RUL) at each segment of measurements (each segment having 100 data points). We used the training data to train the LSTM, and the validation to evaluate the performance, via (1) RMSE (Root Mean Square Error) and (2) correlation between the predicted RUL and the actual RUL. We trained LSTMs with different hyper parameters: setting the number of hidden units to (5,50,100,500) and the number of epochs to (10,20,50). 38 Figure 12 Naive LSTM Therefore, we define binary classification tasks that are amenable given the shortage of data, and use them in combination, to predict the precise RUL. Four classifiers were trained, to predict when a module has reached four stages of its lifetime; half-life, final third, final quarter, and final fifth. Each classifier predicts when a module has reached a specific quantile of its life, based on the segmented measurements input data. Quantiles were used as thresholds because of the large variability in the lifespan of different modules, which ranges from 90 minutes to 1400 minutes (Table 9). All classifiers were trained using the Adam optimizer [66], fifty epochs, mini-batch size of twenty-seven, and one LSTM cell with five hidden units. These parameters were set via hyper parameter optimization, applied to the training data. 39 Therefore, the training step resulted in four trained LSTMs, predicting whether a module has reached the last half, third, quarter, and fifth of its life. Application of each LSTM to the segmented validation dataset produced classification scores between zero and one for each segment, classifying whether a given segment is derived from a time point in the last half, third, quarter, and fifth of the module?s lifetime. Since these are binary classification tasks, we examined the receiver operating characteristic (ROC) curves, and the precision-recall curves (Fig.13) for the four validation modules, for each of the four LSTMs trained. We found that in all cases, the quantile-based performances were robust, providing a solid foundation for using these scores to infer the precise RUL at different time points. 40 precision-recall graph precision-recall graph 1 1 0.8 0.8 0.6 0.6 0.4 V1(0.5) AUPRC=0.97 0.4 V2(0.5) AUPRC=0.97 V1(0.33) AUPRC=0.93 V2(0.33) AUPRC=0.93 0.2 V1(0.25) AUPRC=0.94 0.2 V2(0.25) AUPRC=0.95 V1(0.2) AUPRC=0.94 V2(0.2) AUPRC=0.93 0 0 0 0.5 1 0 0.5 1 Recall Recall precision-recall graph precision-recall graph 1 1 0.8 0.8 0.6 0.6 0.4 V4(0.5) AUPRC=0.88 0.4 V4(0.5) AUPRC=0.95 V4(0.33) AUPRC=0.85 V4(0.33) AUPRC=0.9 0.2 V4(0.25) AUPRC=0.84 0.2 V4(0.25) AUPRC=0.92 V4(0.2) AUPRC=0.83 V4(0.2) AUPRC=0.91 0 0 0.5 1 0 0.5 1 Recall Recall ROC curve ROC curve 1 1 0.8 0.8 0.6 0.6 0.4 V1(0.5) AUC=0.97 0.4 V2(0.5) AUC=0.98 V1(0.33) AUC=0.93 V2(0.33) AUC=0.97 0.2 V1(0.25) AUC=0.92 0.2 V2(0.25) AUC=0.97 V1(0.2) AUC=0.92 V2(0.2) AUC=0.96 0 0 0 0.5 1 0 0.5 1 False positive rate False positive rate ROC curve ROC curve 1 1 0.8 0.8 0.6 0.6 0.4 V3(0.5) AUC=0.9 0.4 V4(0.5) AUC=0.97 V3(0.33) AUC=0.88 V4(0.33) AUC=0.95 0.2 V3(0.25) AUC=0.87 0.2 V4(0.25) AUC=0.95 V3(0.2) AUC=0.87 V4(0.2) AUC=0.94 0 0 0 0.5 1 0 0.5 1 False positive rate False positive rate Figure 13 Classifier performance evaluation. Receiver operating characteristic (ROC) curves and the precision-recall curves showing the performances of the four classifiers. 41 True positive rate True positive rate Precision Precision True positive rate True positive rate Precision Precision LSTMs trained to predict the last half (0.5), third (0.3), a quarter (0.25), and fifth (0.2), when applied to the four validation modules (V1-V4 validation modules 1 through 4). RUL prediction The training process yielded four classifiers, where each classifier assigns a score to every segment of data when it is applied to the segmented input measurements. The score ranges from zero to one, where a higher score denotes a higher likelihood that a segment of data has passed a threshold (last half, third, quarter, and fifth). To use these scores and infer the precise RUL at any given time point, we designed an ensemble technique that combines the scores from all four classifiers. This works through four steps; First, the LSTM that was trained to predict whether a module has reached the last half of its life is applied to each segment of data from a module (ordered from first to last). When the scores of this LSTM for a segment have passed the threshold, we start predicting the RUL and assign that point with the time that has already passed (because the LSTM predicted that half of the time has passed). Second, after the module is predicted to be in the last half of its life, we use the scores produced by the LSTM that was trained to predict when a module has reached the last third of its life. In the time point when these scores pass the threshold, we assign that time point half of the time that has already passed. Similarly, after the first two LSTM thresholds were reached, we use the scores produced by the third LSTM, denoting whether a module has reached the last quarter of its life, and when the scores pass this threshold, we assign that time point one-third of the time that has already passed. Finally, after all, three LSTM thresholds were reached, when the scores produced by the fourth LSTM, predicting when a module has reached the last fifth of its life, pass the threshold, we assign that time point one-fourth of the time that has already passed. 42 To handle outliers and noise in the prediction, all scores were flattened over a window of five consecutive segments before being used for prediction. In every time point that does not correspond to a particular time point when a threshold was reached, the predicted time was defined as the time predicted in the last time point minus the time that has passed from the last time point to the current time point. All four steps are defined with the same threshold of 0.5, to avoid overfitting the training data with the ensemble technique. By definition, the RUL represents a monotonically decreasing time series. Therefore, we do not allow the RUL of any time point ? to be greater than that of the previous time point ? ? 1. To that end, the predicted time is always set to a minimum of the predictions from the four LSTMs. The complete framework is illustrated in Fig.14. Figure 14 Pipeline illustration. The module's measurements are given as input to the four LSTMs, each is trained to predict when a module has reached a certain quantile in its life (last half, third, quarter, and fifths). 43 When applied to time points of data from the validation and test, each of the four LSTMs yields a score between zero and one, denoting the likelihood that this module has reached a certain quantile in its lifetime. These scores are used to predict the precise RUL by casting the time passed until each quantile has reached (black circles). In intermediate time points (dashed lines) the RUL is set by subtracting the time passed from the previous time point, from the predicted RUL of the previous time point. Prediction performances After the training of the four quantile-based LSTMs using the training dataset, we applied the complete prediction pipeline to the four modules in the validation dataset. Encouragingly, we found that the error rates were consistently low, and almost monotonically decreasing when predictions were made later in the life of each module (Fig.15A). In all but one of the modules in the validation dataset, the prediction of the half-life time point had less than a 20% error rate. In all modules in the validation dataset, the prediction in the last tenth of the module?s life had less than 10% error. In particular, the predicted RUL in modules 2 and 4 in the validation were consistently accurate throughout their lifetime, as demonstrated in Figure 15B, C. 44 Figure 15 Validation performance. (a) Bar plots showing the error fraction of the four validation modules, when evaluating in different time points in the life of the module (half-life, last 1/4, 1/6, 1/8, 1/10, 1/12, and 1/14). (b) and (c) demonstrate the true RUL (x-axes, red) vs the predicted RUL (y-axes, blue) for two modules in the validation set (validation modules 2 and 4, respectively). The prediction begins around the half-life of each module. 45 After confirming that the performance of the integrated quantile-based LSTM models on the validation set was sufficiently high and robust (Fig.15), we turned to test the performance of the approach when applied to the test set, which was left unseen through the entitled training process. We find that the prediction performances on the test set were comparably high when evaluated on time points from the last quarter of the module's life. For all five modules in the test set, the error rate was less than 25% in the last quarter of the modules life, and for four out of five modules it was less than 20% in any subsequent evaluation points. Four of the five modules in the test set also show a consistent and substantial decrease in the error rate when nearing the end of the life of these modules eventually having less than a 10% error rate (Fig.16). These results are particularly encouraging given the small size of the dataset in availability, and the difficulty of the classification problem, where previous work (with different datasets) failed when testing on a new and unseen dataset, even when achieving high performance in the training and validation [44]. The consistently low error rates when applied to the validation and the left- out test set demonstrate the robustness of this method even when trained to predict RUL of modules with high variation in lifespan (90-1400 minutes). It is possible that by converting the classification task to a globally defined problem that does not depend on the distribution of samples (recognizing quantiles in the life of modules), this method allows prediction from scratch, without the need to retrain the models and fit to the new distribution of samples. 46 Figure 16 test performance. (a) Bar plots showing the error fraction of the five test modules, when evaluating in different time points in the life of the module (half-life, last 1/4, 1/6, 1/8, 1/10, 1/12, and 1/14). Comparison to previous work The dataset that we used throughout this work was obtained and originally used by Celaya et al. [42]. In this study, Celaya et al. utilized only a subset of the modules (5 modules overall), which are characterized by a lifespan range (as it is defined by them) between 150 and 240 minutes. In their work, they left one module out of training as a test (the module numbered 23). Therefore, to allow comparison with the study of Celaya et al, we ensured that module number 23 was left out to the test set in our study as well (Table 9). In their study, Celaya et al. defined the end of the life of each module not by the failure time and the last point measured, but by the time when the module had reached a certain delta in the Rds(ON) measurement (particularly, 0.05 ohm). In contrast, in this study, we aim to predict the precise end of life of each module. To compare our results with the results of Celaya et al. we examined the same interval relative to the end of the life of module 23 (the last 88 minutes). In their study, Celaya et al. compared three methods for RUL prediction including two model based, Extended Kalman Filter (EKF) and Particle Filter 47 (PF), and one data driven method, Gaussian Process Regression (GPR). PF showed the best performance on the test set module out of the three examined. We compared the prediction from the quantile-based LSTM ensemble on the same test module to the performance of the three methods as described by Celaya et al. Strikingly, we find that the quantile-based LSTM predictor achieved better performances compared to all three techniques, and specifically when reaching the end of the module?s life (Fig.17, Table 10). This is especially notable because, in contrast to the methods applied by Celaya et al., the quantile-based LSTM predictor was trained on a set of modules with a wide range of lifespan (Table 9), thus addressing a more general classification task. These results further support the power of this method in accurately predicting the RUL even when trained and evaluated on small datasets of modules with a wide range of time to failure. Figure 17 comparison to previous work. RUL prediction performance assessment for module number 26, for GPR, EKF and PF as described for Celaya et al.[22], and for the quantile LSTM predictor. 48 RUL Quantile- GPR EKF PF LSTM 88 152.5069228 NAN 64.98 77.65 78 112.8094278 NAN 80.22 65.85 68 83.44517583 NAN 56.64 58.33 48 61.72442772 NAN 50.15 49.47 48 45.65758223 73.2 42.75 38.68 38 33.77293062 33.4 30.35 27.14 33 29.04669114 17.6 18.57 24.76 28 24.98184939 14.6 17.24 21.09 23 21.48584828 13.8 18.28 16.66 18 18.47908332 11.8 13.46 14.68 Table 10 RUL prediction results for GPR, EKF, PF, and the quantile-LSTM for the last 88 minutes in the lifetime of module number 23. RUL prediction error is between parentheses. Project one code overview First function RUN_TRAIN_VALIDATION.m: this script will take the raw data and will give out a structure that contains the validation and training modules: index, scores, labels, and RUL. Input: raw data Output: RES (struct) 49 This function has five steps: First step Step1_process_raw_data: Input: raw data Output: Data (struct) This function goes through each module and all of its parts and uses process_raw_module, which will synchronize the steady-state and in situ measurements. The output will be a table for each module that contain the following features: 1. Rds(ON) upper limit 2. Rds(ON) lower limit 3. In situ measurement time 4. Steady state time 5. Supply voltage 6. Package temperature 7. Drain to source voltage 8. Drain current 9. Flange temperature Second step step2_add_rul Input: data (struct) Output: data26 50 This step takes the 42 module data structure from the previous step filter the modules that have issues and adds the RUL as the 10th feature. Third step step3_convert_data_for_lstm Input: Data (struct from the previous step), t1 and t2 represent the range of modules we want to use for training (and validation), splt is what quantile to mark as 1 (percentile of RUL). Output: Samp is the training (and validation) dataset (each point is 100 timepoints in a module) Samp.S2 is the training, and Samp.l is the labels. moduleN: the module index corresponding to each training point in Samp (from t1 to t2) and RUL: the corresponding RUL. This function takes the modules we selected and breaks each module to 100 timepoints with 50 timepoints overlap. It will remove the steady-state and in situ times and put the RUL in a separate section and the label in the label struct. In the main script we can select the quantiles we wish to train for in line 12 under quants[]. Forth step TrainLSTM1 Input: Samp from the previous step, number of features, training modules, and 0 (not save, saves later in script). Output: TNET, the neural network. This step will use the training data to train a network and will save later in the script each network (for every quantile). The function currently uses 5 hidden units and 50 Epochs. 51 Fifth step testLSTM1 Input: the TNET from the previous step, the training data, and its labels. Output: YPred, scores, L1 and L2 The output is the scores and labels of the data. Each step fills more data to the RES structure that contains the quantiles, module index, training and validation module numbers, networks, labels, scores, and RUL. Project 1 pseudocode Begin function step1_process_raw_data For m in modules For the length of each module steady state time points Convert and insert time point from epoch to date and time Insert supply voltage Insert package temp Insert drain source voltage Insert drain current Insert flange temp End 52 For the length of each module transient time points Convert and insert time point from epoch to date and time Calculate Rdson at each transient time point Insert the 0.25 and 0.75 quantiles of Rdson End For the length of each module transient time points Match each transient time point with the closest steady state time point Insert supply voltage Insert package temp Insert drain source voltage Insert drain current Insert flange temp End End End function 53 Begin function step2_add_rul For m in modules Remove time gaps Calculate and insert RUL End End Begin function step3_convert_data_for_lstm For m in modules Remove times Add labels End For tm in training and validation modules For k from 1 until module length minus 100 with 50 points steps Insert labels to label structure Remove the label from the data Insert data to structure 54 End End End Begin function trainLSTM1 Insert L2 with converted categorical cell to a matrix of labels Set max epoch to 50 Set the mini-batch size to 27 Set input size to size of training points Set the number of hidden units to 5 Set the numclass to the length of L2 Set layers as Sequence input layer (input size) lstm layer (number of hidden units, outputmode, last) Fully connected layer (numclass) Soft max layer Classification layer Set options as 55 Training options (Adam, execution environment, cpu, max epoch, mini-batch size, gradient threshold, 1, verbose, 0, plots, training progress) Tnet is a train network with data L2 layers and options End Begin function testLSTM1 Set L1 as the conversion of the labels Calculate the prediction and scores using the trained network and the data End Project two: Quantile-based LSTM Remaining Useful Life prediction of filtration system. Dataset 2 This dataset was published on the PHME2020 fifth European conference of the prognostics and health management society 2020, as part of the PHME2020 data challenge. The dataset was collected using the following experiment: 56 An experimental rig was constructed to demonstrate a clogging failure in a filter, it was designed using the following components: Pump, liquid tanks, tank stirrer, pressure and flow rate sensors, pulsation dampener, filter, and data acquisition system. The experiment rig contains a circuit with a pump flowing a liquid through a filter from one tank to another. The circuit is monitoring the flow rate and the liquid pressure before and after the filter, using sensors. The fluid injected in the system is a suspension composed of Polyetheretherketone (PEEK) particles and water with different concentrations. The circuit includes a dampener to eliminate possible pulsations in the flow. Figure 18 depicts the employed experimental rig [67]. Figure 18 System of the experimental rig [67]. The main components of this experimental rig are the following: Pump: A peristaltic pump has been used since the system will involve contaminants in the fluid, as its mechanism is more tolerant to particles in the liquid. The model of the peristaltic pump is 57 Masterflex? SN-77921-70 (Drive: 07523-80, Two Heads: 77200-62, Tubing: L/S? 24) it was installed in the system to maintain the flow of the prepared suspension. The pump is providing a flow rate ranging from 0.28 to 1700 ml/min (i.e. from 0.1 to 600 RPM). Dampener: To prevent the system from unwanted tube expansion due to pressure build-up, which affects the actual pressure build-up generated from filter clogging they used rigid tubing. They installed a Masterflex? pulse dampener, and to eliminate any pulsation in flow, it was installed on the downstream side of the pump. The pump side is covered with a flexible Tygon? LFL pump tubing and the majority of the system is furnished with rigid polypropylene tubing. Particles: The suspension is composed of water and Polyetheretherketone (PEEK) particles. PEEK particles have a significantly low water absorption level (0.1% / 24 hours, ASTM D570) and a density of (1.3g/cm3) which is close to that of room temperature water. The low water absorption level will prevent particles expansion when they mix with water. Also, the closer density with water allows particles to suspend longer in water. Flow Rate Sensor: To keep track of the flow rate in the system A GMAG100 series electromagnetic flowmeter is installed which has a measurement range from 3 to 25,000 milliliters per minute. Pressure Sensors: To capture the pressure drop (?P) across the filter, which is considered as the main indicator of clogging, upstream and downstream Ashcroft? G2 pressure transducers were installed which measurement range from 0 to 100 PSI. Filter: the filter has a pore mesh size of 125?m as shown in Figure 19. 58 Figure 19 Filter under study Several experiments have been run with this experimental rig, with suspensions of different concentrations and particle sizes. For each particle size, we perform four experiments with the concentration reported in Table 11. Profile Number 1 2 3 4 Water (g) in the 7968 7497 7079 6704 suspension tank Particle (g) 32 32 32 32 Solid ratio 0.004 0.00425 0.0045 0.00475 (0.40%) (0.43%) (0.45%) (0.48%) Table 11 Suspension profile details 59 We add particles of three possible sizes to create the suspensions, we added small, medium, and large. Details about the particle size are reported in Table 12. Particle Size (?m) Small 45-53 Medium 53-63 Large 63-75 Table 12 Particle size data The collected dataset of the experiment contains the liquid flow rate, and the pressure before and after the filter (upstream and downstream), it was collected at a 10Hz rate. Any pressure drop higher than 20 psi was identify as failure due to filter clog, the Pressure Drop was computed by deducting the upstream Pressure from the downstream pressure. The dataset was divided into three sets training, validation, and test. The training dataset contained 24 runs to failure experiments of the filter, the dataset contains experiments with small or large particles sizes and different particle concentrations. Four experiments were done for each particle size and concentration combination. The following table describes the sets. Profile Number Particle Size (?m) Solid Ratio (%) Sample Size 1 45-53 0.4 4 2 Small 0.425 4 3 0.45 4 60 1 63-75 0.4 4 2 Large 0.425 4 3 0.45 4 Table 13 Training set ? 24 samples The validation dataset contained 8 runs to failure experiments of the filter, the dataset contains experiments with small or large particles sizes and constant particle concentrations. Four experiments were done for each particle size. The following table describes the sets. Profile Number Particle Size (?m) Solid Ratio (%) Sample Size 1 45-53 0.475 4 Small 2 63-75 0.475 4 Large Table 14 Validation set ? 8 samples The test dataset contained 16 runs to failure experiments of the filter, the dataset contains experiments with only medium particles sizes and different particle concentrations. Four experiments were done for each particle concentration. The following table describes the sets. 61 Profile Number Particle Size (?m) Solid Ratio (%) Sample Size 1 53-63 0.4 4 Medium 2 53-63 0.425 4 Medium 3 53-63 0.45 4 Medium 4 53-63 0.475 4 Medium Table 15 Test set ? 16 samples Procedure Data representation The dataset was downloaded in .cvs format and was imported into MATLAB, after it was imported it was converted from a table format to an array, this array included the following features: ? Time in seconds with 0.1-second constant intervals. ? The flow rate is in ml/m. ? Upstream pressure in psi. ? Downstream pressure in psi. ? Particle size in microns and solid ratio percentage were given in separate .xlsx files and were manually added. 62 ? Pressure drop was also calculated by deducting the upstream Pressure from the downstream pressure. The dataset provider defined a clogged filter as a filter with 20 (psi) pressure drop; therefore, we only used the data up to that point. The lifespan of the filter varies in a wide range from 277.5-172.6 seconds. The list of the filters life span can be seen in the following table: Filter number Set assignment Lifetime seconds 1 Training 273.4 2 Training 277.5 3 Training 273.5 4 Training 256.1 5 Training 265.5 6 Training 266.2 7 Training 234.8 8 Training 238.1 9 Training 235.2 10 Training 212.2 11 Training 212.9 12 Training 210.8 13 Training 213.4 14 Training 206 15 Training 203.5 63 16 Training 192.8 17 Training 194.6 18 Training 193.1 19 Training 178 20 Training 175.6 21 Training 175.2 22 Training 172.6 23 Training 176 24 Training 173.2 25 Validation 273.1 26 Validation 261.1 27 Validation 234.2 28 Validation 210.9 29 Validation 204.7 30 Validation 198.8 31 Validation 182.1 32 Validation 173.2 33 Test 221.5 34 Test 223.7 35 Test 224.7 36 Test 226.1 37 Test 213 38 Test 215.5 64 39 Test 215.3 40 Test 210.8 41 Test 205.7 42 Test 206.1 43 Test 206.8 44 Test 208.9 45 Test 196.8 46 Test 198.9 47 Test 200.4 48 Test 198.3 Table 16 filters set assignments and lifespans After creating the feature table, we add the labels. Then the data was converted to data points that each one contains five-time steps (half a second), and the label of each data point was the roundup of the average of the labels. The validation data was then put aside and the training dataset was used to train the LSTM network, in this work five classifiers were created; the last 50%, 40%, 30%, 20%, and last 10%, after training the classifiers it was used to evaluate the training and validation datasets and eventually the test dataset. This procedure was done five times for each of the classifiers. After gathering all the scores given by the classifiers an RUL estimator was designed to convert the scores to an RUL estimation. 65 RUL prediction After the training process, we have five classifiers, each of these classifiers produces a score between zero and one for each data point when applying the network on the data, where a higher score represents a higher probability of not passing the threshold, of that classifier (10%-50%). After combining all the scores we achieve a matrix in the size of Mi*5 where Mi is the length of each filter data point, each data point is equivalent to 0.5 seconds. To convert the matrix to an actual RUL we developed a similar ensemble technique to the one we used in the first dataset. In this ensemble method we have five steps; First, we use the first classifier we trained, to predict the point we reach 50%, we scan the data points scores, until the score of the first classifier reaches below our set threshold of 0.8 when we reach that point we set our initial RUL prediction, as the same time that has passed, as we predict we reached half of the filter life. Only after reaching the first threshold, we start looking at the second classifier scores, when that score reaches below 0.8, we multiply the time that has passed by 2/3, since we believe 60% of the time 2 has passed and we predict that 40% has left (60% ? = 40%). After reaching the second 3 classifier we move to the next classifier, we use the same threshold of 0.8, and when reaching that point we multiply the time that passed by 3/7, as we predict that 70% of the time has passed and 30% left. The fourth and fifth step uses the same threshold of 0.8 and multiplies the times by 1/4 and 1/9 to complete all the prediction of the RUL. 66 Prediction performances In the following section, the prediction results of each filter RUL will be presented. In the small graphs, the red line represents the prediction, the first change in the red line shows the start of the prediction, while each change in the line afterword represents the next classifier prediction, and the blue line represents the true RUL. The X-axis will show the time, each time point is worth 0.5 seconds of data, and the Y-axis is the RUL in seconds. The final graph at each section will show the full-length RUL prediction given by each classifier, every classifier will be shown in a different color and a dark blue line will show the true RUL, in that graph the X-axis represents the filter number, and the Y-axis shows the RUL in seconds. Training and validation datasets The following figures show the prediction results on the training and validation datasets. The first change in the red line shows the start of the prediction, and we can see that in most cases we underpredict the true RUL, which can prevent an unforeseen failure of the system, by under predicting the RUL. 67 Figure 20 Filters 1-8 from the training dataset Figure 21 Filters 9-16 from the training dataset 68 Figure 22 Filters 17-24 from the training dataset Figure 23 Filters 1-8 from the validation dataset 69 Figure 24 Training + Validation We can also see that the prediction improves, as we get closer to the end of life of the filter, the improved prediction by the later classifiers ensure higher accuracy of the RUL as we approach the end of life. The graphs also show that in the case of over predicting the RUL, the next classifier will most likely mitigate the error. These performances looked promising and after seeing that the overall accuracy levels remain the same, the mean absolute error (MEA) and the percentage error were calculated for each classifier as can be seen in table 17. ?32?=1|?? ? ??| ??? = 32 In addition, the percentage error was calculated using the following formula: 32 100 ?? ? ?? ? = ? | | 32 ?? ?=1 Where: 70 ?? - RUL prediction. ?? - RUL true value. Classifier MAE (seconds) M (%) 50% 10.27 8 40% 9.61 2.74 30% 7.73 0.38 20% 3.01 1.22 10% 1.71 0.3 Table 17 MAE and M per classifier for training and validation datasets combined The results show that the classifiers are accurate in predicting the RUL. It also shows the improvement in predicting the RUL when getting closer to the end of life of the filter. After reviewing the results of the training and validation, we proceed to apply the same classifiers and ensemble procedure on the test dataset. Test dataset The following figures show the prediction results on the test dataset. The first change in the red line shows the start of the prediction, and we can see that similar to the training and validation dataset, in most cases we underpredict the true RUL at first. 71 Unlike the training and validation datasets, we over predict with the second classifier, in most cases, but the following classifiers adjust the prediction and improve the overall accuracy of the prediction. Figure 25 Filters 1-8 from the test dataset 72 Figure 26 Filters 9-16 from the test dataset Figure 27 Test Dataset 73 We can also see that, similar to the training and validation, the prediction improves as we get closer to the end of life of the filter. Different from the training and validation classifier, we can see that the first classifier, 50% of life, is very accurate on the test dataset, MAE is 4.74 seconds. We see in most cases the second classifier, last 40% of life is less accurate, but, as in the training and validation, we see that the following classifiers will most likely mitigate the error. We can see in the following table the accuracy of the different classifiers on the test dataset. Classifier MAE (seconds) M (%) 50% 4.74 3.06 40% 14.4 11.13 30% 11.2 7.6 20% 3.28 1.65 10% 2.5 0.93 Table 18 MAE and M per classifier for the test dataset 74 Conclusions Recurrent neural networks, and particularly gated recurrent neural networks such as LSTM, present the state of the art performance in learning time-series data and for time-series classification tasks. However, in the field of prognostics, the shortage of reliable data of measurement of operational features and the respective RUL at each time point, presents a considerable challenge, hindering the straightforward application of LSTM to predict the RUL given a set of measurements (Fig.12). In this work, I develop an approach that can overcome these limitations by representing the classification problem (RUL prediction given a set of measurements) as a set of simpler classification problems, for all of which we can achieve a high level of performance using LSTM recurrent neural networks. By converting the classification problem of ?what is the module RUL at time point t? that could not be answered by the LSTM, to a set of questions: ?did the module passed half of its life?, ?did the module passed 2/3 of its life?, ?did the module passed ? of its life? etc., which are successfully answered by the LSTM, we can eventually answer the original question ?what is the RUL?. There are 3 novel components to this work: (1) the representation of the data: the features that are used and represented to maximize the performance eventually and increase the number of time points while allowing the classifier to learn. (2) The proof of concept that while the classification task of ?what is the RUL? given a set of measurements is difficult cannot be solved with confidence, there is a set of classification tasks that can be solved with confidence given a set of measurement, such as ?did the module passed half of its life?, and are solved with high levels of accuracy by LSTM recurrent neural networks. (3) Finally, I developed an ensemble 75 pipeline for the integration of different quantile-based LSTM classifiers, which allows the prediction of the precise RUL at different points in the life of the module. Overall, this work facilitates utilization of LSTM RNNs, the state of the art in time series prediction, for accurate prediction of the RUL for MOSFET modules and filtration system with a wide lifespan range, when trained on a small dataset, thus allowing a substantial improvement for data-driven RUL prediction. Future work is warranted to further study the application of this technique to different classification problems in reliability that involve time-series data. Contributions ? Developed a new method that outperform standard RNN LSTM in predicting RUL. ? Established a robust and more accurate method to determine RUL from run to failure dataset, when given a small dataset, using RNN LSTM. ? A new method of data representation, data blocks, increases the number of data points and their significance. ? Using quantiles as thresholds to make the method more robust. ? Ability to use a wide range of lifespan data and utilize it to predict RUL. ? A method that can be applied to any run to failure datasets, not just electronics. 76 Future work ? Identifying the type of failure in addition to the RUL. ? Adding confident interval and probability function to the RUL prediction. ? Using real-time maintenance data to determine the maintenance contribution and its increase to the RUL. 77 Citations [1] A. K. Sch?mig and O. Rose, "On the suitability of the Weibull distribution for the approximation of machine failures," in IIE Annual Conference. Proceedings, 2003, p. 1: Institute of Industrial and Systems Engineers (IISE). [2] A. Heng, A. C. Tan, J. Mathew, N. Montgomery, D. Banjevic, and A. K. Jardine, "Intelligent condition-based prediction of machinery reliability," Mechanical Systems and Signal Processing, vol. 23, no. 5, pp. 1600-1614, 2009. [3] D. Banjevic and A. Jardine, "Calculation of reliability function and remaining useful life for a Markov failure time process," IMA journal of management mathematics, vol. 17, no. 2, pp. 115-130, 2006. [4] E. Bechhoefer, "A method for generalized prognostics of a component using Paris law," in Annual Forum Proceedings-American Helicopter Society, 2008, vol. 64, no. 2, p. 1460: AMERICAN HELICOPTER SOCIETY, INC. [5] F. Zhao, Z. Tian, and Y. Zeng, "Uncertainty quantification in gear remaining useful life prediction through an integrated prognostics method," IEEE Transactions on Reliability, vol. 62, no. 1, pp. 146-159, 2013. [6] M. Behzad, H. A. Arghan, A. R. Bastami, and M. J. Zuo, "Prognostics of rolling element bearings with the combination of paris law and reliability method," in 2017 Prognostics and System Health Management Conference (PHM-Harbin), 2017, pp. 1-6: IEEE. 78 [7] T. Wang, Z. Liu, M. Liao, and N. Mrad, "Life prediction for aircraft structure based on Bayesian inference: towards a digital twin ecosystem," in Annual Conference of the PHM Society, 2020, vol. 12, no. 1, pp. 8-8. [8] W. K. Yu and T. A. Harris, "A new stress-based fatigue life model for ball bearings," Tribology transactions, vol. 44, no. 1, pp. 11-18, 2001. [9] M. Ga?perin, ?. Juri?i?, P. Bo?koski, and J. Vi?intin, "Model-based prognostics of gear health using stochastic dynamical models," Mechanical Systems and Signal Processing, vol. 25, no. 2, pp. 537-548, 2011. [10] J. R. Celaya, C. S. Kulkarni, G. Biswas, and K. Goebel, "Towards a model-based prognostics methodology for electrolytic capacitors: A case study based on electrical overstress accelerated aging," Int. J. Prognostics Health Manage., vol. 3, no. 2, p. 33, 2012. [11] D. A. Tobon-Mejia, K. Medjaher, N. Zerhouni, and G. Tripot, "A data-driven failure prognostics method based on mixture of Gaussians hidden Markov models," IEEE Transactions on reliability, vol. 61, no. 2, pp. 491-503, 2012. [12] H. Ocak, K. A. Loparo, and F. M. Discenzo, "Online tracking of bearing wear using wavelet packet decomposition and probabilistic modeling: A method for bearing prognostics," Journal of sound and vibration, vol. 302, no. 4-5, pp. 951-961, 2007. [13] Y. Ao and G. Qiao, "Prognostics for drilling process with wavelet packet decomposition," The International Journal of Advanced Manufacturing Technology, vol. 50, no. 1-4, pp. 47-52, 2010. 79 [14] R. Huang, L. Xi, X. Li, C. R. Liu, H. Qiu, and J. Lee, "Residual life predictions for ball bearings based on self-organizing map and back propagation neural network methods," Mechanical Systems and Signal Processing, vol. 21, no. 1, pp. 193-207, 2007. [15] P. Baraldi, M. Compare, S. Sauco, and E. Zio, "Ensemble neural network-based particle filtering for prognostics," Mechanical Systems and Signal Processing, vol. 41, no. 1-2, pp. 288- 300, 2013. [16] X. Li, Q. Ding, and J.-Q. Sun, "Remaining useful life estimation in prognostics using deep convolution neural networks," Reliability Engineering & System Safety, vol. 172, pp. 1-11, 2018. [17] N. Khera and S. A. Khan, "Prognostics of aluminum electrolytic capacitors using artificial neural network approach," Microelectronics Reliability, vol. 81, pp. 328-336, 2018. [18] H. Taghavifar and A. Mardani, "Applying a supervised ANN (artificial neural network) approach to the prognostication of driven wheel energy efficiency indices," Energy, vol. 68, pp. 651-657, 2014. [19] S. Rebello, H. Yu, and L. Ma, "An integrated approach for system functional reliability assessment using Dynamic Bayesian Network and Hidden Markov Model," Reliability Engineering & System Safety, vol. 180, pp. 124-135, 2018. [20] Q. Xiao, Y. Fang, Q. Liu, and S. Zhou, "Online machine health prognostics based on modified duration-dependent hidden semi-Markov model and high-order particle filtering," The International Journal of Advanced Manufacturing Technology, vol. 94, no. 1-4, pp. 1283-1297, 2018. 80 [21] J. Yu, "Adaptive hidden Markov model-based online learning framework for bearing faulty detection and performance degradation monitoring," Mechanical Systems and Signal Processing, vol. 83, pp. 149-162, 2017. [22] F. Z. Feng, D. D. Zhu, P. C. Jiang, and H. Jiang, "GA-SVR based bearing condition degradation prediction," in Key engineering materials, 2009, vol. 413, pp. 431-437: Trans Tech Publ. [23] H.-Z. Huang, H.-K. Wang, Y.-F. Li, L. Zhang, and Z. Liu, "Support vector machine based estimation of remaining useful life: current research status and future trends," Journal of Mechanical Science and Technology, vol. 29, no. 1, pp. 151-163, 2015. [24] A. Nuhic, T. Terzimehic, T. Soczka-Guth, M. Buchholz, and K. Dietmayer, "Health diagnosis and remaining useful life prognostics of lithium-ion batteries using data-driven methods," Journal of power sources, vol. 239, pp. 680-688, 2013. [25] H.-E. Kim, A. C. Tan, J. Mathew, E. Y. Kim, and B.-K. Choi, "Machine prognostics based on health state estimation using SVM," in Asset Condition, Information Systems and Decision Models: Springer, 2012, pp. 169-186. [26] C. Cempel, H. Natke, and J. Yao, "Symptom reliability and hazard for systems condition monitoring," Mechanical Systems and Signal Processing, vol. 14, no. 3, pp. 495-505, 2000. [27] C.-J. Du and D.-W. Sun, "Learning techniques used in computer vision for food quality evaluation: a review," Journal of food engineering, vol. 72, no. 1, pp. 39-55, 2006. 81 [28] A. Rastogi, R. Arora, and S. Sharma, "Leaf disease detection and grading using computer vision technology & fuzzy logic," in 2015 2nd international conference on signal processing and integrated networks (SPIN), 2015, pp. 500-505: IEEE. [29] A. K. Paul, D. Das, and M. M. Kamal, "Bangla speech recognition system using LPC and ANN," in 2009 Seventh International Conference on Advances in Pattern Recognition, 2009, pp. 171-174: IEEE. [30] V. V. Krishnan, A. Jayakumar, and A. P. Babu, "Speech recognition of isolated Malayalam words using wavelet features and artificial neural network," in 4th IEEE International Symposium on Electronic Design, Test and Applications (delta 2008), 2008, pp. 240-243: IEEE. [31] W. Gevaert, G. Tsenov, and V. Mladenov, "Neural networks used for speech recognition," Journal of Automatic control, vol. 20, no. 1, pp. 1-7, 2010. [32] R. Jafari-Marandi, S. Davarzani, M. S. Gharibdousti, and B. K. Smith, "An optimum ANN-based breast cancer diagnosis: Bridging gaps between ANN learning and decision-making goals," Applied Soft Computing, vol. 72, pp. 108-120, 2018. [33] O. W. Samuel, G. M. Asogbon, A. K. Sangaiah, P. Fang, and G. Li, "An integrated decision support system based on ANN and Fuzzy_AHP for heart failure risk prediction," Expert Systems with Applications, vol. 68, pp. 163-172, 2017. [34] S. Vijayarani, S. Dhayanand, and M. Phil, "Kidney disease prediction using SVM and ANN algorithms," International Journal of Computing and Business Research (IJCBR), vol. 6, no. 2, 2015. 82 [35] J. T. Connor, R. D. Martin, and L. E. Atlas, "Recurrent neural networks and robust time series prediction," IEEE transactions on neural networks, vol. 5, no. 2, pp. 240-254, 1994. [36] J. Zhang and K. Man, "Time series prediction using RNN in multi-dimension embedding phase space," in SMC'98 Conference Proceedings. 1998 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No. 98CH36218), 1998, vol. 2, pp. 1868-1873: IEEE. [37] Y. Qin, D. Song, H. Chen, W. Cheng, G. Jiang, and G. Cottrell, "A dual-stage attention- based recurrent neural network for time series prediction," arXiv preprint arXiv:1704.02971, 2017. [38] X. Cai, N. Zhang, G. K. Venayagamoorthy, and D. C. Wunsch II, "Time series prediction with recurrent neural networks trained by a hybrid PSO?EA algorithm," Neurocomputing, vol. 70, no. 13-15, pp. 2342-2353, 2007. [39] P. Malhotra, L. Vig, G. Shroff, and P. Agarwal, "Long short term memory networks for anomaly detection in time series," in Proceedings, 2015, vol. 89: Presses universitaires de Louvain. [40] X. Ma, Z. Tao, Y. Wang, H. Yu, and Y. Wang, "Long short-term memory neural network for traffic speed prediction using remote microwave sensor data," Transportation Research Part C: Emerging Technologies, vol. 54, pp. 187-197, 2015. [41] T. Fischer and C. Krauss, "Deep learning with long short-term memory networks for financial market predictions," European Journal of Operational Research, vol. 270, no. 2, pp. 654-669, 2018. 83 [42] J. Celaya, A. Saxena, S. Saha, and K. F. Goebel, "Prognostics of power MOSFETs under thermal stress accelerated aging using data-driven and model-based methodologies," 2011. [43] L. Guo, N. Li, F. Jia, Y. Lei, and J. Lin, "A recurrent neural network based health indicator for remaining useful life prediction of bearings," Neurocomputing, vol. 240, pp. 98- 109, 2017. [44] K. Pugalenthi, H. Park, and N. Raghavan, "Prognosis of power MOSFET resistance degradation trend using artificial neural network approach," Microelectronics Reliability, vol. 100, p. 113467, 2019. [45] Y. Zhang, R. Xiong, H. He, and M. G. Pecht, "Long short-term memory recurrent neural network for remaining useful life prediction of lithium-ion batteries," IEEE Transactions on Vehicular Technology, vol. 67, no. 7, pp. 5695-5705, 2018. [46] Z. Shi and A. Chehade, "A dual-LSTM framework combining change point detection and remaining useful life prediction," Reliability Engineering & System Safety, vol. 205, p. 107257, 2021. [47] R. Guo, Y. Wang, H. Zhang, and G. Zhang, "Remaining useful life prediction for rolling bearings using EMD-RISI-LSTM," IEEE Transactions on Instrumentation and Measurement, vol. 70, pp. 1-12, 2021. [48] K. Goebel, B. Saha, and A. Saxena, "A comparison of three data-driven techniques for prognostics," in 62nd meeting of the society for machinery failure prevention technology (mfpt), 2008, pp. 119-131. 84 [49] C. K. Williams and C. E. Rasmussen, Gaussian processes for machine learning (no. 3). MIT press Cambridge, MA, 2006. [50] T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. ?ernock?, "Strategies for training large scale neural network language models," in 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, 2011, pp. 196-201: IEEE. [51] G. Hinton et al., "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups," IEEE Signal processing magazine, vol. 29, no. 6, pp. 82- 97, 2012. [52] A. Graves, A.-r. Mohamed, and G. Hinton, "Speech recognition with deep recurrent neural networks," in 2013 IEEE international conference on acoustics, speech and signal processing, 2013, pp. 6645-6649: IEEE. [53] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in neural information processing systems, 2012, pp. 1097-1105. [54] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, "Learning hierarchical features for scene labeling," IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1915-1929, 2012. [55] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler, "Joint training of a convolutional network and a graphical model for human pose estimation," in Advances in neural information processing systems, 2014, pp. 1799-1807. 85 [56] C. Szegedy et al., "Going deeper with convolutions," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1-9. [57] N. Auslander, Y. I. Wolf, and E. V. Koonin, "In silico learning of tumor evolution through mutational time series," Proceedings of the National Academy of Sciences, vol. 116, no. 19, pp. 9501-9510, 2019. [58] P. Werbos, "New tools for Prediction and Analysis in the Behavioral Sciences," Ph. D. dissertation, Harvard University, 1974. [59] D. Parker and L. Logic, "Technical Report TR-47," Massachusetts Institute of Technology, 1985. [60] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning representations by back- propagating errors," nature, vol. 323, no. 6088, pp. 533-536, 1986. [61] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, "Distributed representations of words and phrases and their compositionality," in Advances in neural information processing systems, 2013, pp. 3111-3119. [62] I. Sutskever, J. Martens, and G. E. Hinton, "Generating text with recurrent neural networks," in Proceedings of the 28th international conference on machine learning (ICML-11), 2011, pp. 1017-1024. [63] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997. 86 [64] S. Saha, J. R. Celaya, V. Vashchenko, S. Mahiuddin, and K. F. Goebel, "Accelerated aging with electrical overstress and prognostics for power MOSFETs," in IEEE 2011 EnergyTech, 2011, pp. 1-6: IEEE. [65] Y. Umuroglu et al., "Finn: A framework for fast, scalable binarized neural network inference," in Proceedings of the 2017 ACM/SIGDA International Symposium on Field- Programmable Gate Arrays, 2017, pp. 65-74. [66] D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014. [67] F. e. c. o. t. p. a. h. m. s. 2020. (2020). phme data challenge 2020. Available: phmeurope.org/2020/data-challenge-2020 87