ABSTRACT 
 
 
 
 
Title of Dissertation: MACHINERY ANOMALY DETECTION 
UNDER INDETERMINATE OPERATING 
CONDITIONS  
  
 Jing Tian, Doctor of Philosophy, 2018 
  
Dissertation directed by: George E. Dieter Professor, Michael Pecht, 
Department of Mechanical Engineering 
 
 
Anomaly detection is a critical task in system health monitoring. Current practice of 
anomaly detection in machinery systems is still unsatisfactory. One issue is with the 
use of features. Some features are insensitive to the change of health, and some are 
redundant with each other. These insensitive and redundant features in the data 
mislead the detection. Another issue is from the influence of operating conditions, 
where a change in operating conditions can be mistakenly detected as an anomalous 
state of the system. Operating conditions are usually changing, and they may not be 
readily identified. They contribute to false positive detection either from non-
predictive features driven by operating conditions, or from influencing predictive 
features. This dissertation contributes to the reduction of false detection by 
developing methods to select predictive features and use them to span a space for 
anomaly detection under indeterminate operating conditions. 
 
Available feature selection methods fail to provide consistent results when some 
features are correlated. A method was developed in this dissertation to explore the 
correlation structure of features and group correlated features into the same clusters. 
A representative feature from each cluster is selected to form a non-correlated set of 
features, where an optimized subset of predictive features is selected. After feature 
selection, the influence of operating conditions through non-predictive variables are 
removed. To remove the influence on predictive features, a clustering-based anomaly 
detection method is developed. Observations are collected when the system is healthy, 
and these observations are grouped into clusters corresponding to the states of 
operating conditions with automatic estimation of clustering parameters. Anomalies 
are detected if the test data are not members of the clusters. Correct partitioning of 
clusters is an open challenge due to the lack of research on the clustering of the 
machinery health monitoring data. This dissertation uses unimodality of the data as a 
  
criterion for clustering validation, and a unimodality-based clustering method is 
developed.  
 
Methods of this dissertation were evaluated by simulated data, benchmark data, 
experimental study and field data. These methods provide consistent results and 
outperform representatives of available methods. Although the focus of this 
dissertation is on the application of machinery systems, the methods developed in this 
dissertation can be adapted for other application scenarios for anomaly detection, 
feature selection, and clustering. 
 
 
 
 
 
 
  
 
 
 
 
 
 
 
 
 
MACHINERY ANOMALY DETECTION UNDER INDETERMINATE 
OPERATING CONDITIONS   
 
 
 
by 
 
 
Jing Tian 
 
 
 
 
 
Dissertation submitted to the Faculty of the Graduate School of the  
University of Maryland, College Park, in partial fulfillment 
of the requirements for the degree of 
Doctor of Philosophy 
2018 
 
 
 
 
 
 
Advisory Committee: 
Professor Michael Pecht, Chair 
Dr. Michael H. Azarian 
Professor Abhijit Dasgupta 
Professor F. Patrick McCluskey 
Professor Peter Sandborn 
Professor Yunfeng Zhang 
 
 
 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
© Copyright by 
Jing Tian 
2018 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
 
Acknowledgements 
I would like to thank my advisor Prof. Michael Pecht. His guidance and training let 
me grow to a critical thinker in research and an enlightened person in life. I would 
like to express my gratitude to my co-advisor Dr. Michael H. Azarian. Without his 
sacrifice of time and patient guidance I would never master the skills and techniques 
to complete my research. I am very grateful to my committee members, Prof. Abhijit 
Dasgupta, Prof. F. Patrick McCluskey, Prof. Peter Sandborn, and Prof. Yunfeng 
Zhang for their sacrifice of time to help me to complete my dissertation. I would also 
like to express my appreciation to Dr. Michael Krein and Janet Wedgewood from 
Lockheed Martin, Dr. William Hardman and Dr. Jason Hines from NAVAIR for their 
support and inspiration. I would like to thank my colleagues at CALCE. Especially, 
Dr. Carlos Morillo and Dr. Myeongsu Kang have given me constant help in solving 
problems. 
 
I owe a lot to my family. Thank you, dear mother, for your understanding. Thank you, 
dear father. The faulty bearings and rolling elements you tested were the earliest 
seeds of this dissertation. A special note to Petermao. Thank you for your heroism in 
the earthquake. 
 
 ii 
 
 
Table of Contents 
 
Acknowledgements ....................................................................................................... ii 
Table of Contents ......................................................................................................... iii 
List of Tables ................................................................................................................ v 
List of Figures .............................................................................................................. vi 
List of Abbreviations ................................................................................................. viii 
Chapter 1: Introduction ................................................................................................. 1 
1.1 Background ......................................................................................................... 2 
1.2 Motivation ........................................................................................................... 8 
1.3 Overview of the Dissertation .............................................................................. 9 
Chapter 2: Literature Review ...................................................................................... 10 
2.1 Issues Related to Features in Anomaly Detection ............................................ 10 
2.2 Survey of Feature Selection Methods ............................................................... 13 
2.2 Issues Related to Anomaly Detection Methods ................................................ 17 
2.4 Survey of Anomaly Detection Methods ........................................................... 19 
2.4.1 Survey of Supervised Anomaly Detection ................................................. 19 
2.4.2 Survey of Unsupervised Anomaly Detection ............................................ 26 
2.4.3 Survey of Semi-Supervised Anomaly Detection ....................................... 30 
2.5 Problem Statement and Objectives ................................................................... 33 
Chapter 3: Feature Selection ....................................................................................... 35 
3.1 Development of the Feature Selection Method ................................................ 35 
3.2 Evaluation of the Feature Selection Method ..................................................... 45 
3.2.1 Feature Selection Method Evaluation Using Simulated Data .................... 46 
3.2.2 Feature Selection Method Evaluation Using Benchmark Data ................. 50 
3.2.3 Feature Selection Method Evaluation Using Experimental Data .............. 52 
3.2.4 Field Study of the Feature Selection Method ............................................ 56 
3.3 Summary ........................................................................................................... 58 
Chapter 4: Anomaly Detection Using Unimodality-Based Clustering ....................... 59 
4.1 Development of Unimodality-Based Clustering ............................................... 61 
4.2 Evaluation of Unimodality-Based Clustering ................................................... 72 
4.3 Development of the Anomaly Detection Method Using Unimodality-Based 
Clustering ................................................................................................................ 78 
4.4 Evaluation of the Anomaly Detection Method ................................................. 84 
4.3.1 Evaluation of the Anomaly Detection Method Using Simulated Data ...... 84 
4.3.2 Evaluation of the Anomaly Detection Method Using Simulated Data ...... 86 
4.3.2 Evaluation of the Anomaly Detection Method Using Experimental Data 90 
4.4 Summary of Anomaly Detection ...................................................................... 97 
Chapter 5: Contributions and Future Work ................................................................ 98 
Bibliography ............................................................................................................. 102 
 
 
 
 
 iii 
 
 
 
 iv 
 
 
List of Tables 
 
Table 1: An Example of the Influence of Operating Conditions .......................... 9 
Table 2: Feature Selection Results of the Simulated Data .................................. 48 
Table 3: Results of the Wine Data ...................................................................... 50 
Table 4: Results of the Experimental Data ......................................................... 55 
Table 5: Results of the Field Data....................................................................... 57 
Table 6: Results of Simulated Data Anomaly Detection .................................... 86 
Table 7: Results of Iris Data Anomaly Detection ............................................... 87 
Table 8: Results of Wine Data Anomaly Detection............................................ 88 
Table 9: Results of Breast Cancer Anomaly Detection ...................................... 89 
Table 10: Results of Experimental Data Anomaly Detection ............................. 96 
 
 v 
 
 
List of Figures 
Figure 1: Wind Turbine Drivetrain ....................................................................... 2 
Figure 2: Detecting Bearing Fault Using Shape Factor (Left) and Crest Factor 
(Right) ......................................................................................................... 11 
Figure 3: The Influence of Noise and Redundant Features ................................ 13 
Figure 4: Influence of Rotation Speed (Left) on Vibration Signal (Right) ......... 18 
Figure 5: Patterns in the Feature Space............................................................... 19 
Figure 6: A General Boundary and a Detailed Boundary ................................... 31 
Figure 7: Flowchart of Selection......................................................................... 37 
Figure 8: Correlation Tree of the Iris Data ......................................................... 40 
Figure 9: Optimization of the Feature Selection ................................................. 43 
Figure 10: Useful Features of the Simulated Data .............................................. 46 
Figure 11: Correlation Tree of the Simulated Data ............................................ 47 
Figure 12: The Number of Selected Features as the Number of Noise and 
Redundant Features Increases ..................................................................... 49 
Figure 13: Mean Accuracy as the Number of Noise and Redundant Features 
Increases ...................................................................................................... 49 
Figure 14: Correlation Tree of the Wine Data .................................................... 50 
Figure 15: The Number of Selected Features as the Number of Noise and 
Redundant Features Increases ..................................................................... 51 
Figure 16: Mean Accuracy as the Number of Noise and Redundant Features 
Increases ...................................................................................................... 51 
Figure 17: Experiment Setup .............................................................................. 52 
Figure 18: Correlation Tree of the Experimental Data ....................................... 54 
Figure 19: Selected Features ............................................................................... 56 
Figure 20: An Excerpt of the Correlation Tree for the Aircraft Data ................. 57 
Figure 21: Procedure of Clustering-Based Anomaly Detection ......................... 61 
Figure 22: The Issue of Elbow Method .............................................................. 62 
Figure 23: Unimodality and Clusters .................................................................. 65 
Figure 24: Dip Statistic ....................................................................................... 68 
Figure 25: The Shape of Unimodal Distributions ............................................... 68 
Figure 26: Flowchart of Unimodality-Based Clustering .................................... 71 
Figure 27: Control Overlaps by Setting the Distance Between Centroids.......... 74 
Figure 28: Clustering Performance of Gaussian Clusters Using Unimodality-
Based Clustering (UC), Silhouette-Based Clustering (SC), and Gap-Based 
Clustering (GC) ........................................................................................... 74 
Figure 29: Control the Shape of Clusters ............................................................ 75 
Figure 30: Clustering Performance of Lognormal Clusters ................................ 76 
Figure 31: Clustering Gaussian Data of Different Dimensions .......................... 76 
Figure 32: Clustering non-Gaussian Data of Different Dimensions ................... 77 
Figure 33: Clustering-Based Anomaly Detection ............................................... 79 
Figure 34: Anomaly Detection Procedure Using Unimodality-Based Clustering.
..................................................................................................................... 80 
Figure 35: Using MD as Anomaly Indicator ...................................................... 81 
Figure 36: Determination of the Anomaly Threshold......................................... 83 
 vi 
 
 
Figure 37: Simulated Normal Reference Data and Test Data............................. 85 
Figure 38: Iris Data ............................................................................................. 87 
Figure 39: Wine Data .......................................................................................... 88 
Figure 40: Breast Cancer Data ............................................................................ 89 
Figure 41: Wind Turbine Simulator .................................................................... 91 
Figure 42: Pinion Faults ...................................................................................... 92 
Figure 43: De-Noising Result ............................................................................. 93 
Figure 44: Clustering Result of the Normal Reference Data .............................. 95 
Figure 45: Projecting Test Data to the Feature Space ......................................... 96 
 
 vii 
 
 
List of Abbreviations 
 
1-SVM One-Class Support Vector Machine 
ANN   Artificial Neural Network 
BPNN  Back Propagation Neural Networks 
CALCE Center for Advanced Life Cycle Engineering 
CBM  Condition-Based Maintenance  
cdf  Cumulative Distribution Function 
CTC  Correlation Tree-Cut 
DBN  Deep Belief Networks 
EEMD  Ensemble Empirical Mode Decomposition 
EMD  Empirical Mode Decomposition 
FMMEA  Failure Mode Mechanisms and Effect Analysis 
GA  Genetic Algorithm 
GC  Gap-Based Clustering 
GMM  Gaussian Mixture Model 
HDN  Hierarchical Diagnosis Network 
HMM  Hidden Markov Model 
ISVM-BT Improved Support Vector Machine-based Binary Tree 
KNN  K-Nearest Neighbor 
KS  Kolmogorov–Smirnov  
LDA   Linear Discriminant Analysis 
LR  Logistic Regression  
 viii 
 
 
MD  Mahalanobis Distance 
MLP  Multi-Layer Perceptron 
mRMR Max-Relevancy Min-Redundancy 
PCA  Principal Component Analysis 
pdf  Probability Density Function 
PHM   Prognostics and Health Management 
PoF  Physics-of-Failure  
PSO   Particle Swarm Optimization 
RBF  Radial Basis Function  
RBM  Restricted Boltzmann Machines 
RCF   Rolling Contact Fatigue 
RF  Random Forest  
RFE  Recursive Feature Elimination 
rms  Root-Mean-Square 
RUL   Remaining Useful Life  
SC  Silhouette-Based Clustering  
SI  Silhouette Index 
SOMs  Self-Organizing Maps 
SS  Stability Selection 
std  Standard Deviation 
SVM  Support Vector Machine 
TR-LDA Trace Ratio Linear Discriminant Analysis 
UC   Unimodality-Based Clustering 
 ix 
 
 
UCAD  Unimodality Clustering Anomaly Detection  
WPNE  Wavelet Packet Node Energy  
WPT  Wavelet Packet Transform 
 
 
 
 x 
 
 
Chapter 1: Introduction 
Machinery such as bearings, gears, and shafts are widely used in electromechanical 
systems including electric motors, generators, pumps, turbines, and fans. They usually 
play critical roles in the systems. For example, in a typical wind turbine, the kinetic 
energy of wind is captured by blades to rotate the main shaft, which is constrained to 
rotate in the desired direction by a main bearing. The rotation of the main shafts 
converted by a gearbox to get the desired rotation speed and torque to drive the 
generator, completing the conversion of kinetic energy of wind to the electrical energy. 
The drivetrain of a typical wind turbine is shown in Figure 1. A pump works on the 
opposite direction that the electrical energy is converted to the rotation of the pump 
shaft by an electric motor. The rotation of the shaft is constrained to the desired 
direction by pump bearings and it is converted to the desired rotation speed and torque 
by a gearbox. The output rotation drives the impeller to lift the liquid. In both 
examples, the failure of any bearing, shaft, or gear breaks the required energy 
transmission, resulting in system failures. These machinery components play similar 
critical roles in other systems as well, such as cooling fans, gas turbines, hydro 
turbines, steam turbines, vehicle powertrains, and machine tools. 
 
Although machinery components have been used in various industrial sectors since the 
first industrial revolution, their reliability has remained an issue of research focus due 
to multiple factors. One is the criticality of machinery failures, as machinery 
components failures often lead to system failures. Another factor is the frequency of 
 1 
 
 
machinery failures, which are widespread in some applications. For example, bearing 
failures account for more than 40% of the system failures of induction motors [2]. A 
third factor is the system downtime caused by the machinery failures. In wind turbines, 
gearbox failure is the top contributor of system downtime [3] due to the time spent on 
the diagnosis and replacement of the failed gearbox. 
 
main shaft
 
Figure 1: Wind Turbine Drivetrain [1] 
 
1.1 Background 
When a machinery component fails, corrective maintenance is carried out to detect the 
failed component and restore its reliability. However, because the maintenance is 
carried out after the occurrence of failure, it cannot avoid some failure consequences. 
Because the maintenance time is unexpected, logistics may not be ready. Since a 
machinery component usually takes time to degrade from a healthy state to failure, 
taking maintenance measures before the occurrence of failure is a desired strategy to 
avoid failures. Because the time of the failure is unknown, maintenance is planned to 
 2 
 
 
perform regularly, leading to the development of scheduled maintenance, which is also 
called planned preventive maintenance. Scheduled maintenance has become a standard 
strategy to maintain the machinery reliability in various industry sectors because of 
several advantages over corrective maintenance. First, it restores the reliability of 
machinery components before failures and thus the occurrence of disastrous failure 
consequences can be reduced. Second, it avoids the unscheduled downtime from 
failures. Third, because the maintenance is scheduled, logistics can be prepared in 
advance. However, there are several shortcomings of scheduled maintenance. First, if 
a fault initiated and developed to failure between two maintenances, scheduled 
maintenance cannot detect it. Second, compared with maintenance plans depend solely 
on corrective maintenance, scheduled maintenance avoids unscheduled downtime, but 
it requires more maintenances, which increase maintenance cost and require more 
downtime. Third, when a component is replaced, it may have a significant portion of 
remaining useful life (RUL) left, and thus increases the cost on unnecessary 
component replacement. Finally, scheduled maintenance often involves intrusive 
measurements, increasing the risk of causing damage during the maintenance.  
 
To overcome the shortcomings of the scheduled maintenance, the advent of failure 
should be predicted so the maintenance can be optimized accordingly. The prediction 
of failure needs in-situ monitoring, which becomes possible with the development of 
technologies such as sensing and computation. Based on in-situ monitoring, several 
maintenance strategies have been developed, including predictive maintenance and 
condition-based maintenance (CBM). They are improved further to form prognostics 
 3 
 
 
and health management (PHM), which is an enabling discipline consisting of 
technologies and methods to assess the reliability of a product in its actual life cycle 
conditions to determine the advent of failure and mitigate system risk [4]. PHM aims 
to detect, diagnose, and predict the onset and source of system degradation as well as 
the time to system failure. The goal is to make intelligent decisions about the system 
health and to arrive at strategic and business case decisions [5].The implementation of 
PHM has two major approaches: physics-of-failure (PoF) approach, and data driven 
approach. 
 
The PoF approach utilizes knowledge of hardware configurations and life cycle 
loading to predict the reliability and remaining useful life of the components [27]. 
The major inputs in respect of hardware configurations include material properties 
and machinery structure. The life cycle loading includes operational loads such as 
rotation speed, torque, duty cycles, and environmental loads include temperature, 
relative humidity, and pressure. PoF PHM involves conducting a failure mode 
mechanisms and effect analysis (FMMEA) to identify major failure mechanisms. The 
physics of failure models are developed for the identified failure mechanisms to 
establish functional relationships between the time to failure and various physical and 
stresses parameters. The application PoF PHM has following challenges. First, the 
development of physics of failure models is still an active research area that many 
failure mechanisms lack applicable models. Second, it is common that multiple 
failure mechanisms contribute to the failure together and the interaction between 
these failure mechanisms is often stochastic and is not modeled.  
 4 
 
 
 
Data-driven PHM is an alternative to avoid the challenges in PoF PHM. Data-driven 
PHM does not require physics of failure models for any failure mechanisms, and it 
does not need detailed understanding of failure mechanisms. However, compared 
with PoF PHM, it requires larger amount of data and higher capability of computation. 
With the development and popularization of data acquisition and computation 
technologies, these limits on the application of data-driven PHM has been alleviated. 
In data-driven PHM, data are acquired in-situ using a network of sensors that 
monitors the system. Features carrying health information of the system are extracted 
from the sensor signals through a series of procedures like noise reduction, 
normalization, and transformation. Health states of the system are then estimated 
based on the extracted features via decision making using methods such as machine 
learning techniques. Based on the use of historic data, machine learning techniques 
can be classified as supervised learning techniques, and unsupervised learning 
techniques. In supervised learning, historic data with system health labels are used to 
train an algorithm to establish regions of different health states of the system. Current 
health state of the system is determined by classifying the current data to one of the 
regions. Widely used supervised learning techniques include support vector machine 
(SVM), k-nearest neighbor, artificial neural network, deep learning classifiers and 
regressors, decision tree, and random forest. Unsupervised learning techniques do not 
need data with health labels. They explore the nature of the data from different 
aspects. For example, unsupervised learning uses clustering techniques to partition 
the health monitoring data to multiple clusters according to the density of the data, 
 5 
 
 
and the fault was detected by judging the density of the data. Widely used 
unsupervised learning techniques include k-means clustering, Gaussian mixture 
model, and self-organizing maps.  
 
In both PoF PHM and data-driven PHM, fault detection is a major task across a wide 
range of applications, such as induction motor rotor bars [6], bearings [7], and 
gears[8]. A machinery fault is an abnormal condition that leads to the failure of the 
machinery, which is a state in which the machinery cannot perform its required 
function under stated conditions. Commonly observed faults in machinery include 
pits, indents, and wear in bearings, pits, root crack, wear, and missing teeth in gears, 
and misalignment, wear, and bent in shafts. In most cases, when a fault emerges, the 
machine can still perform its required function until the fault develops to a certain 
degree. For example, a main failure mechanism of rolling element bearing is rolling 
contact fatigue (RCF), which happens even if the bearing is working under stated 
conditions and is lubricated and maintained properly. In RCF, cracks initiate beneath 
or on the contacting surfaces because of stress concentration around deficiencies or 
material impurities and the cyclic loading from the rolling elements. As the cracks 
propagate, some material is removed, forming pits on the surface. At an early stage, 
the bearing can still work as required at the presence of these faults, which are cracks 
and pits. With the development of the cracks and pits, more material is lost, and the 
working profile of the bearing is changed to a degree that the bearing fails to work as 
required. In general, there is a time gap between the emergence of fault and the 
occurrence of failure. If the fault can be detected at an incipient stage and if its 
 6 
 
 
development can be monitored, PHM can be performed to reduce the failure 
frequency and criticality as a result of optimized maintenance.  
 
Machinery fault detection based on PoF PHM monitors the variables of PoF models 
and compares their monitored value to the calculated value from the model. When the 
deviation of the monitored value from the variables exceeds a predetermined 
threshold, the fault is detected. PoF fault detection requires in-depth knowledge of 
machinery failure mechanisms or fault characteristic signal-generating mechanisms [9] 
to construct PoF models, limiting the application range. Therefore, a lot of methods 
were developed for data-driven fault detection, which avoids the limitations of the 
PoF approach. 
 
In data-driven PHM, fault detection is achieved by anomaly detection through 
learning the rules of detection from historical data [10]. Anomalies are patterns in the 
data that do not conform to a defined notion of normal behavior [11]. In machinery 
fault detection, the normal behavior is usually the distribution of healthy reference 
data, which are the health monitoring data collected when the machinery is healthy. 
When a system becomes faulty, the health monitoring data no longer conform to the 
normal behavior defined by the healthy reference data, and thus the behavior is 
considered anomalous. In sum, a fault is a physical state of the machinery, anomalies 
are the representations of the fault in the data, and anomaly detection is the process to 
identify the existence of fault through the detection of anomalies. 
 
 7 
 
 
1.2 Motivation 
Current practice of fault detection is not satisfactory that false detections, including 
both false positive detection and false negative detection, are causing losses. False 
positive detection is also termed type I error. It mistakenly regards healthy data as 
faulty, leading to unnecessary downtime and maintenance cost. False negative 
detection is also termed type II error. It mistakenly regards faulty data as healthy, and 
therefore it may leave catastrophic failures undetected. For example, in 2016, 
undetected gearbox bearing pitting and gear fatigue cracking led to the crash of an 
Airbus Helicopters H225, resulting in 13 deaths [12]. Therefore, it is necessary to 
investigate the problem and improve over available methods. 
 
Two issues contribute to the false detection of anomalies: first, the health monitoring 
data usually contain insensitive and redundant features, and they mislead the 
detection. Features are often extracted using engineering experience about the system. 
Some of them are not sensitive to the faults of specific systems, and thus insensitive 
features exist. Some features are driven by the same underlying factor and they are 
redundant with each other. Useless and redundant information from insensitive and 
redundant features can mislead the detection. Especially, some features are more 
sensitive to the change of operating conditions than the change of system health. 
When operating conditions are changing, false positive detection may occur. Second, 
machinery operating conditions are usually changing due to multiple operation 
regimes and environmental influence. The changing operating conditions are often 
indeterminate because the operating conditions are often not monitored, and the 
 8 
 
 
influence of some operating conditions is unknown. As a result, anomalous states can 
be confused with changed operating states. For example, driving on a road with 
unknown surface quality is a case of indeterminate operating condition. If vibration 
amplitude is used to monitor automobile engine health, a car with a faulty engine 
driving on a smooth road surface can be confused with a healthy engine when the car 
is driving on a rough road surface, as shown in Table 1. Both issues and available 
methods to address them need to be investigated. 
 
Table 1: An Example of the Influence of Operating Conditions 
 
 
1.3 Overview of the Dissertation 
The structure of the remaining dissertation is as follows. Chapter 2 discusses the 
issues related to features and anomaly detection methods and provides a literature 
review of available methods that address the issues. Based on the analysis of the 
literature, research gaps were identified, and objectives of the dissertation is proposed 
to fill the gaps. Chapter 3 introduces the development and evaluation of a feature 
selection method that works when insensitive and redundant features exist. Chapter 4 
introduces the development and evaluation of a clustering method and an anomaly 
detection method based on it that works under indeterminate operating conditions. 
Chapter 5 presents the contributions of the dissertation and suggested future work. 
 9 
 
 
 
Chapter 2: Literature Review 
Inappropriate set of features and incompetent anomaly detection methods are two 
major issues of failed anomaly detection. Therefore, these two issues were 
investigated and available methods to address them were reviewed in this chapter to 
identify research gaps. 
 
2.1 Issues Related to Features in Anomaly Detection 
In-situ monitoring signals themselves are often inadequate for fault detection so 
features are extracted from them to capture the existence of fault. Ranging from 2012 
to 2017, every year more than 20,000 papers about machinery fault features are 
published, estimated by searching using Google Scholar. Many features are 
insensitive to faults, and they are influenced by operating conditions. For example, 
shape factor and crest factor are established features used in rotating machinery fault 
detection [13]. In Figure 2, they were calculated to detect bearing faults. Rotation 
speed of the bearing was used as an operating condition. Figure 2 (Left) illustrates the 
insensitivity of features to a fault: under the same rotation speed, shape factor did not 
separate the data from the healthy bearing and faulty bearing, leading to false 
negative detection. Figure 2 (Right) illustrates the influence of operating conditions 
on features. Crest factor separated the data from a healthy bearing and those from a 
faulty bearing under 600 RPM. However, when the rotation speed was increased to 
 10 
 
 
1200 RPM, a large portion of the data from the healthy bearing overlaps with the 
faulty data at 600 RPM, leading to false positive detection.  
90%
75%
50%
25%
10%
Healthy Faulty Healthy Faulty Healthy
600 RPM 600 RPM 600 RPM 600 RPM 1200 RPM
 
Figure 2: Detecting Bearing Fault Using Shape Factor (Left) and Crest Factor (Right) 
 
Researchers have tried to increase the sensitivity of machinery anomaly detection by 
using multiple features together to perform multivariate analysis. Some researchers 
constructed features for specific failure modes and perform multivariate analysis based 
on them. Tian et al. [14] constructed 5 features, each of which is sensitive to a bearing 
failure mode. Some researchers used as many features as possible. Xia et al. [15] 
constructed 21 bearing fault features using signal processing techniques. Oh et al. [16] 
constructed 1000 bearing features using deep learning techniques. However, for a 
given failure mode, some features are insensitive to the fault. They behave as noise 
and have negative influence on the result. In Figure 3 (left), a benchmark dataset Iris 
Data was used to demonstrate the influence of noise features. Iris data have 4 features 
and 3 classes. Additional features consist of Gaussian noise were appended to the 
original data as noise features. Logistic regression was applied to perform 
classification. The performance of classification was evaluated by 5-fold cross-
 11 
 
Shape factor
Crest factor
 
validation, and the mean accuracy is used as a performance measure. When the 
number of noise features increases, the mean accuracy has a decreasing trend. 
Therefore, feature selection is necessary to select useful features that are capable of 
separating anomalies from normal data. Besides noise features, redundant features also 
have negative influence on the analysis. Redundant features are the features linearly 
correlated with each other. Some features are redundant with each other because they 
are measuring the same dynamics of the data. For example, the vibration signals 
collected by two accelerometers on the same surface of a gearbox chassis are 
redundant features. Peak-to-peak value and rms from vibration signals are likely 
correlated with each other. In Figure 3 (right), redundant features were appended to the 
Iris data. The redundant features are linear transforms of the original Iris data features 
with random constant terms. As the number of redundant features increases, the 
accuracy decreases. 
 
In multivariate analysis, variables are sometimes termed as attributes, dimensions, or 
features. One observation of variables is also called a point or an object. In some cases, 
sensor signals or other forms of raw data are directly used in anomaly detection. To be 
consistent with academic and industrial conventions, the selection of variables is 
called feature selection, no matter the variables are attributes of raw data or features. 
 12 
 
 
Example: including noise Example: including redundant 
features in Iris data features in Iris data
Number of noise features Number of redundant features
 
Figure 3: The Influence of Noise and Redundant Features 
 
2.2 Survey of Feature Selection Methods 
Available feature selection methods can be categorized as filter methods, wrapper 
methods, embedded methods, and ensemble methods. Filter methods select features 
based on the individual features’ properties towards the objective of the specified 
machine learning task. For example, in binary classification, individual features are 
evaluated that any feature provides a certain degree of separation of the data 
independently is selected. One approach of filter methods is hypothesis testing. The 
null hypothesis is, the data from two classes are sampled from the same distribution 
using a specific feature. If the null hypothesis is rejected, the data from the two classes 
are not regarded as from the same distribution, and it means the feature is able to 
separate the two classes. The hypothesis testing methods used in feature selection 
include t-distribution, F-distribution, and KS-distribution. Another approach selects 
features depending on the similarity between individual features and the response. 
Similarity measures used in feature selection include Pearson’s correlation coefficient 
 13 
 
Logistic regression accuracy
Logistic regression accuracy
 
[17] and mutual information [18]. The features selected by filter methods can be 
highly correlated and thus some selected features are redundant [17]. Therefore, a 
greedy search algorithm was developed in [19] to compensate this drawback. An 
objective function was used that only the feature maximizes the mutual information 
between the feature and the response while the mutual information between the 
selected feature and the subset of the so far selected features is a minimum. This type 
of optimization problem was summarized as max-relevancy, min-redundancy (mRMR) 
[20], and different optimization methods have been developed to solve the problem 
[21]. Filter methods select features based on the performance of individual features 
and thus they ignore the joint effect of data separation of some features. For example, 
when the data are only separable by a nonlinear function, all the variables in the 
nonlinear function should be selected as useful features but filter methods do not 
realize their relationships because they just evaluate one feature at a time.  
 
Wrapper methods use a search procedure such as forward selection and backward 
elimination to search the optimal subset. Forward selection starts with an empty subset 
and it repeatedly includes one feature at a time into a subset that optimizes an 
objective function [22]. The procedure stops when a threshold on the objective 
function is reached. Backward elimination starts with all features and repeatedly 
removes features one at a time until a criterion is satisfied. In a complete search, the 
approach takes O(n2) calls of the machine learning algorithm, and thus it is impractical 
in computation. Therefore, heuristic search algorithms have been implemented. For 
example, simulated annealing [23], genetic algorithm (GA) [24] and particle swarm 
 14 
 
 
optimization (PSO) [25] have been applied to search the optimized subset of features. 
These methods are not guaranteed to converge to the global optimum given finite 
iterations. Therefore, they are often simplified that a predefined order of selection is 
used to reduce the number of combinations for the feature subset. Representative 
method is recursive feature elimination (RFE), such as RFE support vector machine 
(RFE-SVM) [26][27].However, since a predefined order of selection is used, some 
combinations of features are not evaluated and the optimal subset of features may not 
be found. 
 
Embedded approach incorporates feature selection as part of the training process of a 
machine learning algorithm. The idea is to rank the features according to their 
weights or importance assigned by the algorithm during training. For example, the 
hyperplane of a linear support vector machine (SVM) is the optimized linear model 
that maximizes the separation of the data from different classes. The features with 
larger absolute values of weight are the ones contribute more to the separating 
hyperplane, and they are the useful features. Representative methods include neural 
network pruning [28] and decision tree-based feature selection[29]. For example, 
Krishnakumari et al. [30] used decision tree to select features from a group of 
vibration features and then applied a fuzzy classifier to diagnose spur gear fault. 
Some embedded methods utilize regularization to penalize the size of feature set, 
such as Lasso regression using L1 regularization, Ridge regression using L2 
regularization, and elastic net using both L1 and L2 regularization [31]. When some 
features are correlated, regularization-based methods suffer from the inconsistency of 
 15 
 
 
selection, which is the randomly selection of features from correlated features. 
 
Ensemble features selection approach was developed to aggregate the power of 
different feature selection procedures. In [32], SVM is used as the feature selection 
algorithm, and it is run on multiple bootstrap samples to generate an ensemble of 
feature sets and a subset of features is selected from the ensemble. To reduce the 
influence of the inconsistency of regularization-based methods, stability selection 
using the idea of ensemble selection was developed, such as randomized Lasso for 
regression and randomized logistic regression for classification [33]. By aggregating 
the feature selection power of decision trees, random forest has been implemented in 
feature selection [34]. In [35], multiple feature selection algorithms from filter, 
embedded and wrapped approaches were combined. Ensemble approach reduces the 
influence of correlated features but still suffer from instability. For example, in 
random forest, correlated variables are used interchangeably in the trees. As a result, 
the less relevant variables often replace the useful ones as selected features [36]. In 
randomized lasso and randomized logistic regression, when the size of a group of 
correlated features increases, the weights of the features in the group decreases, 
leading to incorrect model interpretation and misleading feature ranking [37]. Feature 
selection based on feature clustering [38] has been used for removing the correlation 
bias. However, this method only considers linear model since it depends on Lasso 
regression. Moreover, correlated features were averaged to make new features that the 
physical meaning of original features is lost. In [39], k-means clustering was improved 
to automatically estimate the feature groups. Because the clustering is based on an 
 16 
 
 
Euclidean distance-type dissimilarity, the challenge from the correlation of features is 
not addressed. For features extracted by principal component analysis (PCA) and 
manifold learning techniques, correlation is not an issue because the extracted features 
are orthogonal to each other. However, similar to the shortcoming of [38], the features 
lose their physical meaning. Keeping the physical meaning of the original features is a 
desired property of feature selection methods because feature selection is often applied 
in choosing optimal sensor set and interpreting failure physics. These tasks cannot be 
completed without features with physical meanings. 
 
2.2 Issues Related to Anomaly Detection Methods 
When features capable of separating anomalies from normal data are extracted and 
selected, they are still under the influence of operating conditions. In an experimental 
study, an accelerometer was mounted on the housing of a healthy bearing to collect 
vibration signals. The rotation speed was used as an operating condition and it was 
changed several times, as shown in Figure 4 (Left). The vibration signal was affected, 
and it changed with the rotation speed, as in Figure 4 (Right). Without knowing the 
rotation speed, the increase of amplitude and frequency in the vibration signal can be 
mistakenly regarded as anomalies. Therefore, some researchers used the signals of the 
operating conditions to normalize the signals used in health monitoring. In the case of 
vibration analysis, Prof. R. B. Randall at the University of New South Wales 
advocates the use of order tracking [40], and it has been applied to normalize the 
frequency of the vibration signals using the rotation speed in wind turbine bearing 
fault diagnosis [41]. More generally, some researchers develop different healthy data 
 17 
 
 
models for different states of operating conditions, such as multi-regime modeling 
PHM from the team of Prof. J. Lee at the University of Cincinnati [42]. Using the 
similar idea, Sammaknejad et al. [43] modeled observations around different process 
operating modes by different multivariate Student's t-distributions to describe 
different likelihoods of anomalies. Above methods require monitoring of operating 
conditions and thus they do not work when the operating conditions are indeterminate. 
Therefore, anomaly detection methods without the information of operating 
conditions are needed. 
35
0.1
30
25 0.05
20 0
15
-0.05
10
-0.1
5
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18
Time (s) Time (s)
 
Figure 4: Influence of Rotation Speed (Left) on Vibration Signal (Right) 
 
In the bearing example of Figure 4, a sliding window was applied to the vibration 
signal, and for each window an observation of features was calculated. In the example, 
rms (Dim 1) and Kurtosis (Dim 2) were calculated. These features span a feature space 
and certain patterns in the feature space can be observed, as shown in Figure 5. 
Therefore, machine learning can be used to detect anomalies without knowing 
operating conditions by learning these patterns and differentiate them from those of 
anomalies. 
 18 
 
Rotation Speed (Hz)
Amplitude (g)
 
5.5
5
4.5
4
3.5
3
2.5
0.005 0.01 0.015 0.02 0.025
Dim 1  
Figure 5: Patterns in the Feature Space 
 
2.4 Survey of Anomaly Detection Methods 
Anomaly detection methods using machine learning can be categorized as supervised 
anomaly detection [44], unsupervised anomaly detection [45][46][47], and semi-
supervised anomaly detection [48][49]. Different methods have different requirements 
on the data and different application ranges.  
2.4.1 Survey of Supervised Anomaly Detection 
Supervised anomaly detection is based on statistical classification, which identifies to 
which of a set of classes a test observation belongs by learning the training data of 
these classes. Besides detecting the occurrence of the anomaly, supervised methods 
can also identify some properties of the anomaly, such as the type or location of the 
anomaly. Therefore, supervised methods are usually used for both the detection and 
diagnosis of anomalies. To implement supervised anomaly detection, training data for 
both healthy data and anomalies are required to be labeled. Classification techniques 
are used to set up decision rule from learning the labeled healthy data and anomalies to 
 19 
 
Dim 2
 
determine if testing data are anomalies. The anomaly detection task is a binary 
classification problem. Multiple classification is implemented to determine the type, 
location, and severity of anomalies if corresponding labeled training data are available. 
Supervised methods used in machinery anomaly detection include support vector 
machine (SVM), K-nearest neighbor (KNN), artificial neural network (ANN), linear 
discriminant analysis (LDA), naïve Bayes classifier, hidden Markov model (HMM), 
logistic regression, decision tree, random forest, and hybrid ensemble learning 
classification algorithms. 
 
In support vector machine (SVM), the data space is partitioned into normal regions 
and abnormal regions by hyperplanes [50]. If testing data fall into the abnormal region, 
they are detected as anomalies. Du et al.[51] constructed features using wavelet 
analysis and in the feature space bearing faults from different locations and severities 
were classified by SVM.As an improvement of SVM, Improved Support Vector 
Machine-based Binary Tree (ISVM-BT) was applied in [52][53] to classify different 
fault categories. ISVM-BT takes the advantage of both the efficient computation of the 
tree architecture and the high accuracy of SVM. However, users need to define the 
optimal hierarchy of the ISVM-BT, which lacks available rules. SVM methods work 
on both linear and nonlinear data, and they are robust against outliers. Their results are 
sensitive to the selection of kernel functions and parameters, but there is no efficient 
method for the selection. Because SVM is memory intensive and the search of the 
parameters requires multiple iterations of training and testing, SVM methods are not 
suitable for large datasets. 
 20 
 
 
 
K-nearest neighbor (KNN) classifier compares the distance from a testing data 
observation to its nearest neighbors in the healthy training data and the anomalies 
training data. If the distance to the neighbors of anomalies training data is closer, it is 
classified as an anomaly. [54] extracted bearing features using principal component 
analysis (PCA), and then classifies the bearing health states using KNN. KNN is a 
nonparametric method that it does not pose any assumptions on the data and therefore 
it has a wide range of applications. However, KNN is sensitive to outliers. If irrelevant 
or redundant features are included, KNN may also generate undesired result. 
 
Artificial neural network (ANN) simulate the structure of brain that multiple layers of 
neurons are interconnected to establish the relationship between the input and output. 
In the bearing fault detection and diagnosis, Ali et al. [55] used empirical mode 
decomposition (EMD) to extract features from vibration signals, and then used back 
propagation neural networks (BPNN) with two hidden layers to classify the anomalies. 
With the development and successful applications of deep learning techniques based 
on ANN, researchers began to explore the opportunity of implementing deep learning 
to machinery anomaly detection. Gan et al. [56] presented a deep learning application 
architecture for fault detection and fault severity classification using features 
constructed from the coefficients of wavelet packet decomposition. They developed a 
Hierarchical Diagnosis Network (HDN) based on Deep Belief Networks (DBN) that 
are constructed from multiple layers of Restricted Boltzmann Machines (RBM). The 
HDN consists of two layers with one for fault identification and another for severity 
 21 
 
 
classification. ANN methods can provide the best classification accuracy among 
available classification algorithms. However, they need large amount of training data, 
and they are sensitive to outliers. 
 
Discriminant analysis uses a discriminating function to perform classification. The 
optimal coefficients of the discriminating function are estimated that that the distance 
between classes is maximized, and the distance within the classes is minimized. The 
most widely used discriminating function is linear, leading to linear discriminant 
analysis (LDA). LDA assumes the data follow Gaussian distribution, but actual 
machinery data are usually non-Gaussian. Therefore, Jin et al. [57] uses trace ratio 
linear discriminant analysis (TR-LDA), which is a variant of LDA without the same 
normality requirement of the original LDA, to diagnose the bearing fault.  
 
Bayesian network is a probabilistic graphical model that represents a set of variables 
and their conditional dependencies via a directed acyclic graph (DAG). Observations 
are classified according to their probabilities received from the network. A widely 
used type of Bayesian network is naïve Bayes classifier, which assumes the features 
are independent. Because the independence assumption is difficult to meet, Zhang et al. 
[58] used decision tree and selective SVM to select features with low correlations, and 
then applied naïve Bayesian classifier to perform fault diagnosis. Hidden Markov 
model (HMM) is another type of Bayesian network. It models the training data with 
Markov process with one hidden state. In [59] shift-invariant dictionary learning was 
used to extract features for bearing signals, and then HMM was applied to detect and 
 22 
 
 
diagnose bearing faults. Because Bayesian network methods set strict assumptions on 
the distributions and the dependencies of the data, which are rarely met, their 
classification accuracy usually fail to match with other algorithms if they are properly 
tuned. 
 
Logistic regression trains a logistic function with the labeled training data by 
optimizing a cost function, such as the negative log-likelihood of the true labels given 
the predictions. During testing, the membership probabilities of test observations are 
calculated. Authors in [60] used logistic regression to classify engine health. In [61], 
logistic regression was applied to the features extracted from wavelet packet 
decomposition for bearing fault diagnosis. Compared with algorithms such as ANN 
and SVM, the result from logistic regression has better probabilistic interpretation, and 
its training process is efficient. However, if the decision boundary is nonlinear, its 
classification performance fails to be compared with ANN and SVM. 
 
Decision trees classify observations by sorting them based on feature values. Each 
node in a decision tree represents a feature in an observation to be classified, and each 
branch represents a value that the node can assume. Observation are classified starting 
at the root node and sorted based on their feature values. Because decision tree is 
working based on setting rules on features, its process facilitate intuitive understanding 
in engineering. In [62], decision tree was set up to detect machinery faults consist of 
mass imbalance, gear fault, and belt fault. Because classification algorithms such as 
 23 
 
 
decision trees, ANN and KNN are sensitive to outliers, they were often used in an 
ensemble to make a consensus decision, giving rise to ensemble learning. 
 
Ensemble learning combines the results of multiple classifiers to give a consensus 
decision. Because different classifiers have different sensitivity to outliers, ensemble 
learning is robust against overfitting. Random forest is an ensemble learning method 
that aggregates the estimation of a diverse set of decision trees To detect and diagnose 
bearing fault, Wang et al. [63] extracted features using wavelet packet transform, and 
then applied random forest in the feature space to perform classification. In the fault 
detection and diagnosis of spur gear in [64], an initial set of features from vibration 
signals were extracted from time domain, frequency domain, and wavelet transform. 
The classification performance of the subsets of these features were evaluated by a 
random forest classifier. The optimal subset was identified using genetic algorithm. 
Based on the selected subset of features, bearing health states were classified using 
random forest. Because diversity among classifiers is desired in ensemble learning, 
hybrid classifiers with an ensemble of classifiers induced from different classification 
algorithms were developed. To detect gear fault, Lei et al. [65] extracted features using 
envelope analysis, wavelet packet transform, and empirical decomposition, and then 
used a hybrid classifier consists of multi-layer perceptron (MLP) neural network, 
radial basis function (RBF) neural network, and KNN to perform the detection. Tian et 
al. [66] extracted commonly used features for rotating machinery from raw signals to 
form the base sample, and then bootstrap samples were generated. Each bootstrap 
sample was used to train a set of different classification algorithms to obtain a group 
 24 
 
 
trained classifier. The final decision was obtained by majority voting from the groups 
of trained classifiers from all the bootstrap samples. Because this hybrid classifier has 
two sources of diversity from bootstrapping and different algorithms, it is less likely to 
provide estimation with high variance. Ensemble learning methods can provide the 
best classification accuracy, they are robust against outliers, and they do not set 
restrictions to the statistical properties of the data. However, there is no explicit rule to 
determine the setup of the ensemble, and the training process is usually not as efficient 
as other algorithms. 
 
The application of supervised anomaly detection in machinery anomaly detection is 
limited by the requirement on the data. First, the requirement on the labeled faulty data 
as anomalous training data is difficult to meet. The training data from faulty systems 
are usually unavailable. For example, rotating machinery are critical components in 
safety-critical systems such as airplanes, and faulty data from these systems are 
unavailable. Some researchers have tried to fill this gap by simulating faulty data, as in 
the study of Tian et al. [14], where faulty bearing data are simulated using a faulty 
bearing signal generating model. In general, to simulate faulty data of a component, 
data generating models based on the physics of failure of the component are required. 
However, those models are often unavailable due to lack of modeling or inadequate 
understanding of the physics of failure. Second, even if faulty training data are 
collected in some cases, the sample size is not comparable to the data from the healthy 
systems because the faulty component either develops to failure or is processed by 
maintenance when it is noticed. Therefore, the training data from the healthy class and 
 25 
 
 
the faulty class are not balanced. Although techniques such as upsampling of the faulty 
training data or downsampling of the healthy training data can alleviate the biased 
classification of the unbalanced data, the faulty training data only represent a portion 
of the faulty population so that the classification boundaries are still biased towards the 
healthy training data. 
 
2.4.2 Survey of Unsupervised Anomaly Detection 
Unsupervised anomaly detection avoids the requirement on the labeled data. In 
unsupervised anomaly detection, anomalies are detected using unlabeled data based on 
pre-defined assumptions that differentiate normal data and anomalies. A review of 
unsupervised anomaly detection is provided in [67]. Unsupervised anomaly detection 
starts with partitioning health monitoring data into clusters from the assumption that 
the data from healthy systems and the data from faulty systems are generated by 
different mechanisms and they form different clusters, as in the example in [68] that 
fuzzy c-means clustering identified multiple clusters in the feature space when the 
same number of bearing health states exist. In [69], noise Clustering and Density 
Oriented Fuzzy c-Means algorithms were used to eliminate outliers, and then kernel c-
means algorithm with optimized parameters was applied to maximize the separation of 
clusters, which were assumed to be from different health states. An improved artificial 
ant clustering technique was applied in [70] to automatically group data consist of 
observations from healthy motor, motor with broken bar, and motor with faulty 
bearing. In [71], wavelet packet transform (WPT) and ensemble empirical mode 
decomposition (EEMD) were applied to extract features from bearing vibration signals, 
 26 
 
 
and then an adaptive feature selection technique was developed to remove redundant 
features. Using these features, affinity propagation clustering was applied to partition 
the data into clusters, which are assumed to correspond to different bearing health 
states. Using vibration signals, Li et al. [72] extracted features using empirical wavelet 
transform and autoregressive model. After dimensionality reduction using locality 
preserving projection, fuzzy c-means was applied to cluster the data in the feature 
space. Observations from different health states were found to concentrate around 
certain centroids.  
 
Above papers did not talk how to determine if certain clusters correspond to fault 
conditions to complete anomaly detection. To identify the clusters of anomalies, the 
properties of the clusters need to be evaluated based on pre-defined assumptions. One 
assumption is based on the densities of clusters. For a given observation, the radius of 
a hyper-sphere centered at the observation, which contains a defined number of other 
observations, is an estimate of the inverse of the density in the neighborhood of the 
observation. The mean value of the densities for all the observations in a cluster is 
used to represent the density of the cluster. Healthy systems are working in 
equilibrium states, and thus the health monitoring data form clusters with high 
densities. When the systems become faulty, they do not stay in the equilibrium states 
and thus the data are scattered more widely, forming clusters with lower densities. 
These clusters of different densities can be identified using density-based clustering. 
Tian et al. [73] applied a density-based clustering algorithm to partition bearing health 
monitoring data, and used the rate of change of the density as a threshold to detect 
 27 
 
 
anomalies. A challenge to this approach is, under different operating conditions, the 
cluster densities of a healthy system can be different. For example, under higher 
rotation speed and higher load, the vibration of a bearing has higher variance that the 
features extracted from the vibration signal form clusters with lower densities. This 
kind of lower density clusters can be confused with anomaly clusters. A third 
assumption assumes the data from a healthy system and a faulty system are sampled 
from different statistical distributions. For example, if the data in the clusters from the 
healthy system are Gaussian, the non-Gaussian clusters consist of anomalies. There are 
some difficulties to implement this approach because machinery health monitoring 
data are usually high dimensional. First, the evaluation of the high-dimensional 
multivariate distributions, such as multivariate hypothesis testing, is difficult. Second, 
the data in a given cluster may not follow any known multivariate distribution. A 
widely used assumption is based on the sizes of clusters. The data from healthy 
systems are usually abundant, and anomalies are scarce. Because they are generated by 
different data generation mechanisms, they form different clusters, and the clusters of 
anomalies have smaller sizes. Clusters with sizes smaller than a certain threshold are 
regarded as consist of anomalies. This approach can be unreliable because smaller 
clusters can be generated by situations other than anomalies: if the system is working 
briefly in a state of operating conditions, a small cluster is formed even if the system is 
healthy. One assumption assumes anomalies are distributed in a manner that they do 
not form clusters by themselves. Therefore, only the data from healthy systems form 
clusters and the observations do not form any cluster are regarded as anomalies. A 
difficult to implement this approach is from the challenge of setting up threshold to 
 28 
 
 
determine the cluster membership of observations. Moreover, it is common that 
anomalies form clusters by themselves, and thus violating the assumption.  
 
Some researchers borrow the knowledge from the faulty data generating mechanisms 
in PoF for the identification of faulty clusters. K-means clustering was used to 
partition bearing data according to the bearing's health states in [74]. Initial centers of 
the k-means clustering are produced by simulated data. The initial cluster centers for 
the faulty bearing data were simulated using known faulty bearing data generating 
models, and the initial cluster centers for the healthy bearing data were simulated as 
white noise. The observations in the clusters developed from the initial centers of the 
simulated faulty data are regarded as faulty. Compared with simulating training data in 
supervised anomaly detection, this approach does not have strict requirement on the 
accuracy of the data generating models, because the initial cluster centers are only 
required to be close to the final centers, and k-means algorithm updates its value 
during iterations. However, after the update, the cluster center values may have 
significant changes that the initial cluster center, which represent certain health state, 
may not represent the final cluster, and thus false detection occurs. Moreover, the data 
generating model for many failure modes are not available. Inacio et al. [75] 
developed a recursive clustering approach for machinery fault diagnosis. At the 
beginning, only healthy data are available, and after the clustering, a rule was set up to 
identify the healthy cluster. When faulty occurred, the rule was updated that the faulty 
clusters were manually identified. Although training data were not formally labeled, 
knowledge from the actual machine condition was required to identify faulty clusters. 
 29 
 
 
 
2.4.3 Survey of Semi-Supervised Anomaly Detection 
The data from healthy systems are abundant, facilitating the application of semi-
supervised anomaly detection, which uses labeled normal data to define the normal 
behavior and identifies observations deviating from the defined normal behavior as 
anomalies [76]. Statistical anomaly detection techniques assume that anomalies occur 
in the low probability regions of a stochastic model of the healthy data. These 
techniques rely on the assumption that the data follow certain distributions. However, 
real data may not follow these distributions. Representative work includes 
nonparametric statistical analysis [8]. Nearest neighbor-based anomaly detection 
techniques assume that anomalies occur far from the nearest neighbors in the healthy 
reference data. A representative method is k-nearest neighbor (KNN) [7]. Outliers in 
the healthy reference data may lead to false negative errors since they can be regarded 
as close neighbors by an anomaly. Also, these techniques do not consider the influence 
of the distribution of the data on anomaly detection. 
 
Tian et al. [14] used KNN in a semi-supervised manner for motor bearing anomaly 
detection. At first, bearing fault features are extracted using spectral kurtosis, cross 
correlation, and principal component analysis. Then, KNN is applied to calculate the 
distance between the test observation to its nearest neighbors in the training data in the 
feature space. The KNN distance is used as an anomaly indicator that a hypothesis 
testing is applied on it to perform anomaly detection. 
 30 
 
 
 
In [77], an one-class SVM model was trained by healthy bearing data, and during test 
the faulty bearing data were found outside of the model's decision boundaries. To 
reduce the influence of outliers in the healthy training data, one-class SVM was 
improved in [78] for the purpose of anomaly detection. In [79], recurrence time 
statistics of vibration signals were calculated as features to capture the information of 
incipient bearing fault, and one-class SVM was trained by healthy bearing data to set 
up healthy boundaries for anomaly detection. 
 
KNN and one-class SVM methods are sensitive to the choice of hyper-parameters, as 
shown in Figure 6. Choosing the hyper-parameters requires labeled anomalies, which 
are often unavailable. 
A general boundary tends to 
get false negative detection A detailed boundary tends to 
get false positive detection
Detection boundary
* *
Dim 1 Dim 1
 
Figure 6: A General Boundary and a Detailed Boundary 
 
Some researchers model the healthy data as a mixture of distributions, and anomalies 
are detected if test data are not generated by the mixture. In the bearing health 
assessment work of [80], features were extracted by locality preserving projections as 
 31 
 
Dim 2
Dim 2
 
an improvement of feature extraction by principal component analysis (PCA). In the 
feature space a Gaussian mixture model (GMM) was trained by healthy bearing data. 
Kernel density estimation [81] was applied to setup the anomaly threshold. The 
anomaly score of the test data is the exponentially weighted moving average of the 
negative log likelihood probability referring to the trained GMM. 
 
Because GMM is not application for many situations, other clustering techniques are 
used. Pan et al. [82] extracted bearing fault features using wavelet packet transform, 
and then applied fuzzy c-means to the healthy data in the feature space to identify the 
cluster centroid as healthy reference. The deviation from the reference is used as the 
degradation indicator. Although no faulty data were used in the calculation of the 
degradation indicator, the characterization of the values of the indicator still needs the 
centroid of the faulty data cluster. Huang et al. [83] developed a method based on self-
organizing maps (SOMs) that vibration features from healthy bearings are used to train 
SOMs and anomaly score of a test observation is obtained by calculating the minimum 
quantization error of the observation referring to the trained maps. The robustness of 
SOMs in machinery anomaly detection is further improved by Tian et al. [84] that the 
mean quantization error from the test observation referring to its nearest best matching 
units in the trained SOMs is calculated as the anomaly score. 
 
Some researchers combine multiple anomaly detection methods together to construct 
an ensemble. In [85], empirical mode decomposition and Hilbert-Huang Transform 
were applied to extract features from bearing vibration signals, and then a hybrid 
 32 
 
 
ensemble detector from the majority voting of Gaussian anomaly detector, nearest 
neighbor anomaly detector, and PCA anomaly detector was developed to detect 
anomalies in the feature space. 
 
2.5 Problem Statement and Objectives 
A typical scenario in PHM is, a number of sensors and features are prepared from 
different sources of information and feature selection is carried out at an early stage 
that fault sensitive sensors and features can be selected and installed for health 
monitoring. At this stage, experimental study can still be carried out to generate both 
the healthy and faulty data to guide the feature selection. In the health monitoring at a 
later stage, large amount of healthy data under a wide range of operating conditions 
are collected, and faulty data are usually unavailable. 
 
Under this scenario, insensitive and redundant features still remain an issue in feature 
selection. Selecting features without considering feature correlation structure results in 
overfitting from redundant features. Available methods either provide inconsistent 
result when features are correlated or are impractical in terms of computation. Even 
when appropriate features are used, available anomaly detection methods are 
unsatisfactory when operating conditions are indeterminate. Because faulty data are 
usually not available at the stage of anomaly detection and healthy data are abundant, 
semi-supervised learning is an appropriate approach. Modeling healthy data and 
 33 
 
 
setting up hyper-parameters of the model without faulty data is an open issue to be 
addressed. 
 
To address above research gaps, this dissertation has the following objectives: first, 
develop a method to select useful features when insensitive features exist and some 
features are correlated. This objective involves the identification of the correlation 
structure of the features, selecting representatives from correlated features, and 
selecting useful features from representative features. Second, develop a semi-
supervised anomaly detection method that works without the information from 
operating conditions. This objective involves investigation of the influence of 
operating conditions on machinery data, developing semi-supervised anomaly 
detection method by identifying the influence of operating conditions, and developing 
methods to automatically set up model parameters. The feature selection method and 
the anomaly detection method developed in the dissertation have different 
requirements on the data: the feature selection method is a supervised method that the 
training data of all health classes are needed. The anomaly detection method is a semi-
supervised method that only the training data from the healthy class are required. 
Typical application scenarios combining the use of the feature selection and anomaly 
detection methods are machinery health monitoring projects: at the beginning, sensors 
and their mounting locations are evaluated. An initial set of sensors and their locations 
is selected based on domain knowledge. To avoid false negative detection, this initial 
set of sensors tries to include a large number of sensors covering all possible 
conditions. Then, experiments are conducted to collect labeled data for all health 
 34 
 
 
classes of interest. After that, feature selection is performed to select uncorrelated 
useful sensors from the initial sensor set. Based on the correlation structure of the 
sensors identified by the feature selection method, redundant sensors for critical 
features are kept. These selected sensors are the ones most sensitive to the change of 
health conditions and they are installed to the machinery system as the final product. 
During the actual health monitoring, failure data are rare and they have a wider 
diversity than the failure data collected from experiments. Therefore, only healthy data 
are collected by the selected sensors as the training data to train the anomaly detection 
method of this dissertation, which is semi-supervised. 
 
Chapter 3: Feature Selection 
A feature selection method is developed in this dissertation to achieve two objectives: 
first, select useful features. The method should work even when both noise features 
and redundant features exist, and it should accept both linear and nonlinear data. 
Second, identify the correlation structure of the features to find useful redundant 
features. Identifying the useful redundant features is necessary because in some 
applications, such as the sensor systems of aircraft or nuclear plant, redundancy of data 
sources is required for safety concerns. 
3.1 Development of the Feature Selection Method 
The feature selection method searches the optimal subset of useful features using a 
procedure consists of three steps. In the first step, features are clustered according to 
their linear correlations. Correlated features are grouped into the same cluster and they 
 35 
 
 
are regarded as redundant with each other. If a feature is not correlated with other 
features, it is regarded as a cluster containing only itself. Second, from each cluster a 
representative feature is selected. Although the features within the same cluster are 
correlated, they are still different from each other that some features contain more 
useful information, and some contain more noise. The feature that is most capable of 
separating different classes is selected as the representative. Third, representative 
features from all the clusters are concatenated to form a new feature space where all 
the features are noncorrelated. A subset of features is selected by a feature selection 
algorithm from this preprocessed feature space. The first two steps identify redundant 
features and the third step removes noise features. Because redundant features are 
removed in the first two steps, feature selection algorithm will not suffer from the 
correlation. The procedure is illustrated in Figure 7. 
 
If feature clustering and representative selection are performed after useful feature 
selection, the whole procedure will have the same shortcoming as other feature 
selection methods in processing redundant features. For example, if random forest is 
used to select useful features as the first step, features with lower capability of 
separability within a group of highly correlated features have the same chance of being 
selected [36]. If one less useful feature is selected, the features highly correlated with it 
will be abandoned and these features may be more capable of separating different 
classes. After clustering and representative selection, the final feature set loses some of 
the most useful features. Therefore, feature clustering and representative selection 
should be performed before useful feature selection. 
 36 
 
 
Original features
Feature clustering
Noncorrelated 
Feature clusters
features
Representative selection
Representative 
features
Preprocessed
feature space
Feature selection
Subset of 
features
 
Figure 7: Flowchart of Selection 
 
To implement this procedure, several challenges need to be solved. First, what is the 
criterion to determine some features are highly correlated and they should be grouped 
into the same cluster? Second, how to determine if a feature is more capable of 
separating the classes than the rest within a cluster? Third, what feature selection 
algorithm should be used in the third step? 
 
To measure if two features are highly correlated, correlation distance is used. 
Correlation distance d(u,v) between feature u and feature v is defined as in Equation 
(1). 
 37 
 
 
cov(u,v)
 d(u,v) =1−  (1) 
 u v
 
where cov is the covariance operator, σu and σv are the standard deviation of feature u 
and v, respectively. 
 
The correlation distance is developed based on Pearson’s correlation coefficient, 
where 1 indicates perfect positive linear correlation and -1 indicates perfect negative 
linear correlation. Since both positive and negative linear correlations have the same 
influence on feature selection, the strength of linear correlation can be represented by 
the absolute value of Pearson’s correlation coefficient, where 1 indicates perfect linear 
correlation, and 0 indicates no linear correlation. Correlation distance measures how 
much two features’ relationship deviates from perfect correlation, where 0 indicates no 
deviation from perfect correlation, and 1 indicates a total deviation. Correlation 
distance is a dissimilarity measure that can be readily processed by available clustering 
methods. 
 
Using correlation distance as the dissimilarity metrics, correlated features are grouped 
into clusters. Compared with k-means clustering, density-based clustering [86], and 
self-organizing maps, agglomerative clustering provides a correlation tree to describe 
the structure of correlations between the features, which is desirable for intuitive 
understanding. The agglomerative clustering is used in this research in the following 
 38 
 
 
way: at the beginning of clustering, every feature is a cluster itself. These clusters form 
a set F. Two clusters are merged into one cluster based on the criterion of Equation (2). 
Then the two original clusters are removed from F, and the newly merged cluster is 
included in F. This procedure is performed iteratively until F contains only one cluster. 
 ming(a,b) : a,bF  (2) 
where g(a, b) is the distance between cluster a and b, as defined in Equation (3). 
 g(a,b) = mind (u,v) : ua,vb (3) 
The result is a correlation tree that the features in the clusters near the root are less 
correlated than the features at the furthest branches, as shown in Figure 8, where the 
cluster tree of the Iris data is plotted. To complete the clustering, a threshold on the 
correlation distance is needed to cut down the tree. In the Iris example, if the tree is cut 
below node 1, every feature is a cluster itself. If the tree is cut between node 1 and 
node 2, feature 2 and 3 form a cluster, and feature 0 and 1 are two separate clusters. If 
the tree is cut between node 2 and 3, feature 0, 2, and 3 form a single cluster and 
feature 1 is a cluster itself. If the tree is cut above node 3, all the features form a single 
cluster. 
 39 
 
 
Node 0
Node 1
Node 3 Node 2
Correlation distance
 
Figure 8: Correlation Tree of the Iris Data 
When the correlation tree is cut, one or more clusters are formed. Features in the same 
clusters are similar and a representative is selected. Selecting only one representative 
feature from a group of linearly correlated features does not have the same problem of 
filter methods of ignoring the joint effect of separation using multiple features for the 
following reasons: when the features are linearly correlated, they carry the same 
information that contributes to the separation of different classes and therefore 
including multiple linearly correlated features will not improve the result. However, 
perfect linear correlation rarely exists and some features may contain a nonlinear 
relationship that contributes to the separation of classes. Therefore, an optimal value to 
cut the tree that minimizes loss of joint information among features is needed.  
 
The representative is the feature that is most capable of separating different classes of 
data within a cluster. Decision tree is used to evaluate the separating capability 
because of its generosity on the types of data and its advantage of providing intuitive 
 40 
 
Feature Index
 
results. The data from individual features in the same cluster are evaluated by decision 
tree in turn using cross validation. The feature providing the highest mean accuracy is 
selected as the representative of the cluster. For clusters with only one feature, the only 
feature is used as the representative. After representative selection, the correlation 
among the features that restricts the performance of available feature selection 
methods is removed. 
 
To select useful features from the representative features, random forest with Gini 
impurity is used. For each tree in the random forest, the impurity decrease of each 
feature is calculated. The mean impurity decrease across all the trees in the forest is 
calculate for every feature, and the features are ranked according to their mean 
impurity decrease. Compared with feature selection methods based on linear models, 
random forest has a wider application range that both linear and nonlinear data can be 
processed. Since random forest is nonparametric, it does not require the data follow 
any parametric distributions. Being an ensemble learning method, random forest is 
robust against noise and is more likely to provide stable results.  
 
To determine the optimal value to cut down the correlation tree, the cluster should be 
defined specifically. Because in this research the purpose of clustering is to find the 
groups of features that behave similarly in classification, a cluster is defined as a group 
of features sharing the same piece of useful information that does not exist outside the 
group. If the threshold value is too high, a smaller number of clusters are formed, and 
each cluster has more members with higher dissimilarity. Except the information 
 41 
 
 
shared by the members of clusters, some members may specifically contain useful 
information that contributes to the classification. After representative selection, this 
useful information is lost, leading to reduced classification accuracy. If the threshold 
value is too low, a larger number of clusters are formed, and different clusters may 
share the same information. After representative selection, correlated features still 
exist, leading to unstable feature selection and reduced classification accuracy. In both 
scenarios, the definition of clusters of this research is violated. If an optimal threshold 
is selected, the loss of useful information is minimized because all the features in the 
same cluster share the same piece of useful information, and after representative 
selection, this information is kept. The number of redundant features is also minimized 
because the shared useful information does not exist outside the cluster. As a result, 
classification accuracy is maximized. The optimal threshold is determined by 
searching all the tree-cut nodes. In the example of the Iris data, the cluster tree is cut at 
node 1, node 2, and node 3, generating 3 subsets of features by implementing the 
procedure described in Figure 7. The tree-cut node value leads to the highest 
classification accuracy corresponds to the threshold, as shown in Equation (4). 
 dT = argmax A(d)  (4) 
dD
where dT is the threshold, d is the correlation distance, A is the model performance 
metrics, such as accuracy, area under curve, or a transform of generalization error. In 
this research the mean accuracy of cross validation is used. D is the set of correlation 
distance values corresponding to the nodes of correlation tree. 
 
 42 
 
 
In general, if there are m features, the correlation tree has m nodes, including the root 
node. The procedure in Figure 7 is evaluated m times to get the optimal value. The 
dimensionality of most systems is in the order of less than 103, and an exhaustive 
searching strategy is practical. In cases of large dataset with higher dimensions, 
heuristic optimization algorithms such as simulated annealing [87] can be applied for 
the search. The searching strategy is shown in Figure 9. Moreover, since the searches 
are independent from each other, parallel computing is implemented, as shown in the 
dashed box of Figure 9. 
Original features
Feature clustering
Node values
Tree-cut … Tree-cut threshold 1 threshold m
Feature candidate Feature candidate 
subset selection subset selection
Feature Feature 
subset 1 subset m
…
Cross validation Cross validation
Select the feature subset giving 
the best cross validation result
Optimal feature 
subset
 
Figure 9: Optimization of the Feature Selection 
 
 43 
 
 
At first, a correlation tree is obtained by agglomerative feature clustering using 
correlation distance as dissimilarity measure. From the correlation tree the correlation 
distance values at the nodes are obtained. These values of the nodes are used as 
thresholds to cut down the treetop form clusters. For each tree-cut threshold, a feature 
subset is selected using the procedure of Figure 7. Each feature subset is evaluated by 
cross-validation to get its performance in classification. The feature subset giving the 
best cross-validation result is the final selection, and the corresponding threshold value 
is the optimal value for cluster tree-cut. The classifier used in the cross validation can 
be determined by the specific problem. In this research, decision tree is used due to its 
wide application range and intuitive result, which is needed in feature selection. 
Because the whole procedure is based on the searching of the optimal tree cut, the 
method is named correlation tree-cut (CTC) feature selection. This procedure also 
helps to define the concept of correlated features. Measuring by the absolute value of 
Pearson’s correlation coefficient, any pair of features are correlated with a value 
between 0 to 1. 0 indicates no correlation and 1 indicates perfect correlated. In real life 
0 and 1 of the coefficient values are ideal and they are unlikely to occur. Since any 
value above 0 indicates a degree of correlation, to determine if a value is significant, 
researchers have selected fixed thresholds, which is a subjective approach. In this 
research, a flexible threshold is used. Different thresholds lead to different groups of 
features. The threshold that leads to the selection of the optimal feature subset, which 
provides the highest classification accuracy, is selected, and the features with 
correlation coefficients above the threshold are regarded as correlated with each other. 
The true-cut method is choosing a global threshold to cut down the tree to form 
 44 
 
 
clusters. Using localized thresholds for tree-cut may have possibility of utilizing more 
detailed information of the correlation tree to generate clusters leading to higher 
classification accuracy. This will lead to the development of another feature clustering 
algorithm and can be studied as a future research topic. 
 
3.2 Evaluation of the Feature Selection Method 
The correlation tree-cut (CTC) feature selection was compared with widely used 
feature selection methods including random forest (RF) with Gini impurity, stability 
selection (SS) using randomized logistic regression with L1 penalty, recursive feature 
elimination (RFE) using decision tree, support vector machine (SVM), and logistic 
regression (LR) with L1 penalty. For each method, the selected subset of features was 
evaluated by k-fold cross validation. The mean accuracy and the number of selected 
features were used as performance metrics. Higher mean accuracy and smaller number 
of selected features indicate better performance of a method. In the k-fold cross 
validation, decision tree was used to induce classifiers because of its capability of 
classifying both linearly separable and non-separable data and its insensitivity to the 
setup of hyperparameters. Moreover, unlike random forest classifier or hybrid 
classifier [88], decision tree’s own feature selection effect is less likely to mask the 
performance of the feature selection methods under evaluation is set to 5 because 5 
values from the testing results are usually regarded as statistically large enough for the 
calculation of mean and standard deviation (std), and are small enough to give 
adequate observations for the testing fold. Datasets used in the evaluation consist of 
simulated data, benchmark data and experimental data.  
 45 
 
 
 
3.2.1 Feature Selection Method Evaluation Using Simulated Data 
Simulated data, experimental data, and field data were applied to evaluate the method 
to avoid the possibility that any conclusion is made only for specific situations. 
Simulated data were used because the ground truth of useful features is known. Spiral 
data with two classes were simulated because they are challenging cases of nonlinear 
data, and thus the performance of different methods can be distinguished more easily. 
The data have 12 features and 400 observations. The features include 2 useful features 
that separate the classes, as shown in Figure 10. Additional features include 6 features 
of Gaussian noise, 2 features correlated with the 2 useful features, and 2 features 
correlated with 2 noise features. 
Class 1 Class 2
Feature 0
 
Figure 10: Useful Features of the Simulated Data 
 
 46 
 
Feature 1
 
The correlation tree from CTC is shown in Figure 11. It correctly identified feature 0 
and feature 8, feature 1 and feature 9, feature 2 and feature 10, feature 3 and feature 11 
are mutually redundant features due to their linear correlations. 
Tree cut threshold
Correlation distance
 
Figure 11: Correlation Tree of the Simulated Data 
 
CTC automatically selected feature 1 and feature 2 as the useful features. The result is 
consistent with the ground truth. A comparison of the results from all the selected 
methods is shown in Table 1. Compared with other methods, CTC is the only method 
correctly identified the useful features. RF, SS, SVM, and LR mistakenly include 
noise features and some redundant features as useful features. Although RFE selected 
two features, one feature is a redundant feature of an actual useful feature. As a result, 
CTC has higher mean accuracy and lower std accuracy from the cross validation. 
 
 47 
 
 
Table 2: Feature Selection Results of the Simulated Data 
Method CTC RF SS RFE SVM LR
# Features 2 3 9 2 5 8
Mean Accuracy 88.3% 87.7% 86.2% 87.5% 86.2% 86.5%
std Accuracy 0.02 0.033 0.048 0.035 0.028 0.025  
 
The performance of the methods was further evaluated by increasing the number of 
noise features and redundant features, as shown in Figure 12 and Figure 13.In addition 
to the 2 useful features, 2m noise features were added to the feature set. Subsequently, 
m redundant features were added to the feature set, some of which were linearly 
correlated with the useful features and some of which were linearly correlated with the 
noise features. To generate a redundant feature, a feature is randomly selected from the 
feature set, which consists of both useful features and noise features, and the selected 
feature is linearly transformed with a randomly generated constant term. As the m 
increases, the number of features selected by linear models such as SS, SVM, and LR 
increases. This is because the data are not linearly separable, and linear models failed 
to find a subset of features to maximize the classification accuracy. Instead, most 
features were assigned the same weight or importance. RF and REF using decision 
tree are capable of processing nonlinear data and therefore they selected a smaller 
number of features and achieved higher values of mean accuracy than SS, SVM, and 
LR. However, RF and REF randomly select features from correlated features. As a 
result, their performances fluctuated. CTC consistently selected the smallest number of 
features with the highest mean accuracy. An observation is, when the number of noise 
and redundant features is 90 and 120, besides two useful features CTC selected an 
additional noise feature and the mean accuracy was increased as a result. This is 
 48 
 
 
because a randomly generated noise feature may form a separable pattern by chance 
and thus it contributes to the classification. 
 
 
SS
SVM
LR
RFE RF
CTC
Number of noise and redundant features
 
Figure 12: The Number of Selected Features as the Number of Noise and Redundant 
Features Increases 
CTC
RFE
RF
SVM
SS
LR
Number of noise and redundant features
 
Figure 13: Mean Accuracy as the Number of Noise and Redundant Features Increases 
 
 49 
 
Mean accuracy Number of selected features
 
3.2.2 Feature Selection Method Evaluation Using Benchmark Data 
Wine dataset [89] was used as the benchmark data. This dataset has 3 classes, 13 
features, and 178 observations. The correlation tree from CTC is shown in Figure 14. 
CTC regards feature 5, feature 6, and feature 11 as mutually redundant features. It 
identified feature 6, feature 9, and feature 12 as useful features. The results are shown 
in Table 2. Compared with other methods, the number of features selected by CTC is 
the smallest, and the mean accuracy is also the highest. However, its std accuracy is 
not the smallest. 
Tree cut threshold
Correlation distance
 
Figure 14: Correlation Tree of the Wine Data 
 
Table 3: Results of the Wine Data 
Method CTC RF SS RFE SVM LR
# Features 3 7 8 9 5 7
Mean Accuracy 97.2% 95.5% 95.0% 93.2% 93.2% 94.9%
std Accuracy 0.031 0.014 0.048 0.028 0.056 0.038  
 50 
 
Feature Index
 
 
Similar to the evaluation of the simulated data, 2m noise features and m redundant 
features linearly correlated with the useful and noise features were added to the 
original 13 features. The results are shown in Figure 15 and Figure 16. 
RF
LR
SVM
RFE
SS
CTC
Number of noise and redundant features
 
Figure 15: The Number of Selected Features as the Number of Noise and Redundant 
Features Increases 
RF SS CTC
RFE
LR
SVM
Number of noise and redundant features
 
Figure 16: Mean Accuracy as the Number of Noise and Redundant Features Increases 
 
 51 
 
Mean accuracy
Number of selected features
 
CTC consistently selected the smallest number of features while maintaining the 
highest mean accuracy: independent of the choice of the number of noise and 
redundant features, CTC constantly selected the smallest number of features and it has 
the smallest fluctuation compared with other methods. When the number of noise and 
redundant features increased, the mean accuracy of CTC did not drop as other methods. 
In sum, compared with benchmark methods, CTC provided more accurate feature 
selection results and was influenced the least by noise and redundant features. 
3.2.3 Feature Selection Method Evaluation Using Experimental Data 
An application of feature selection is to identify useful features from a large number of 
features in fault diagnosis. Therefore, CTC was evaluated by an experimental study of 
fault diagnosis using a machinery fault simulator, as shown in Figure 17. 
Motor Shaft Accelerometer Bearing
Gearbox Belt drive
 
Figure 17: Experiment Setup 
 
 52 
 
 
The experiment was step for fault diagnosis of rolling element bearings. The bearings 
under test include a healthy bearing, a bearing with outer race fault, a bearing with 
inner race fault, and a bearing with ball fault. The data generated were labeled 
accordingly. A gearbox was installed as a noise source, and the motor driving the 
bearing was run at different rotation speeds. The setup is to simulate the scenario of 
conducting health monitoring under changing operating conditions and masking noise. 
 
An accelerometer was mounted on the housing of the bearing to collect vibration 
acceleration signals at a sampling rate of 25,600 Hz. The sampling rate was selected 
considering the resonance excited by a faulty bearing should be captured. A 
rectangular window with a length of 2 s (51,200 points) was applied to slide along the 
signals with a step of 1s (12,800 points). The window setup gives a frequency 
resolution of 0.5 Hz, and a time resolution of 1 s, which are appropriate for frequency 
domain analysis and in-situ monitoring. For each slide of the window, a vector of 
features was calculated from the portion of the signal inside the window. Widely 
accepted features in time domain, frequency domain, and time-frequency domain were 
calculated. This research used 11 time domain features, including peak-to-peak, rms, 
standard deviation, skewness, kurtosis, impulse factor, crest factor, the square root of 
the amplitude, margin factor, shape factor, and kurtosis factor. Math expressions of 
these features are described in [90]. Frequency domain feature used is the primary 
frequency of the enveloped signal. Time-frequency domain features used are the 
wavelet packet node energies of the first three levels, which generate 14 features. 
Together 26 features are used. The data in each dimension were normalized using Z-
 53 
 
 
score to avoid biased classification from the difference of scales among different 
features. There are 214 observations distributed in 4 classes of bearing health 
conditions. 
Tree cut threshold
Correlation distance
 
Figure 18: Correlation Tree of the Experimental Data 
 
The correlation tree from CTC is shown in Figure 18. CTC found the features 
extracted from the wavelet transforms (feature 12-25) are correlated. In addition, they 
are also correlated with peak-to-peak and rms of the signal in time domain. Among 
these correlated features, feature 25 has the highest capability of separating different 
bearing classes. Impulse factor (feature 5) and crest factor (feature 6) are correlated 
and the impulse factor was selected. 
 
 54 
 
Feature Index
 
The result is shown in Table 3. Compared with other methods, CTC selected the 
smallest number of features and achieved the highest mean accuracy and the smallest 
standard deviation. The two selected features are impulse factor (feature 5) of the 
signal in the time domain and the energy of wavelet packet (3, 7) (feature 25), which is 
the detail of the last node of the third level. This selection is consistent with the 
engineering analysis. When the bearing becomes faulty, rolling elements strike the 
fault as they run over it, generating impulses. When the signal becomes impulsive, the 
distribution of the signal has heavier tails that the kurtosis increases. The strike of the 
rolling elements on the fault excited the resonance of the mechanical structure. Due to 
the compact structure of the bearing, the resonance usually has a high frequency. As a 
result, the resonance information is picked up by wavelet packet node energy (WPNE) 
(3, 7), which represents the highest frequency band of the chosen wavelet packet 
transform. 
Table 4: Results of the Experimental Data 
Method CTC RF SS RFE SVM LR
# Features 2 12 25 13 20 12
Mean Accuracy 98.6% 97.2% 94.4% 95.8% 93.9% 96.3%
std Accuracy 0.019 0.028 0.024 0.023 0.044 0.024  
 
The two selected features are plotted in Figure 19. Both features were normalized by 
calculating their Z score. Because rotation speed was changed several times, the 
observations of every class form multiple clusters. The clusters from different classes 
are separable by decision tree classifier in the space spanned by the two selected 
features and thus a high mean accuracy is achieved. 
 55 
 
 
Outer race fault Ball fault
Inner race fault
Healthy bearing
Impact factor
 
Figure 19: Selected Features 
 
3.2.4 Field Study of the Feature Selection Method 
CTC was implemented to evaluate the sensors of a fleet of carrier aircraft from 
Lockheed Martin. The data were collected from a fleet of more than 20 airplanes. Each 
airplane has used 57 sensors in health monitoring. A specific type of fault was 
observed on some airplanes and historic data were labeled as faulty and healthy 
according to the maintenance findings and engineering judgment. It is assumed some 
of the 57 sensors are sensitive to the fault of interest and some are not. The task is to 
identify the fault sensitive sensors, which are the useful features in this research. The 
original labeled data have several millions of observations. They were down-sampled 
and a sample of 648 observations was randomly selected for evaluation. 
 
 56 
 
WPNE (3, 7)
 
The result is shown in Table 4. CTC selected only one features but it achieved the 
highest mean accuracy. This result is practical that an interpretable rule can be setup 
with this feature using decision tree. 
 
Table 5: Results of the Field Data 
Method CTC RF SS RFE SVM LR
# Features 1 3 48 25 47 44
Mean Accuracy 87.5% 70.1% 65.4% 66.3% 64.5% 65.1%
std Accuracy 0.025 0.060 0.031 0.051 0.006 0.031  
 
An excerpt of the correlation tree is shown in Figure 20. Several groups of correlated 
features are identified. However, the selected sensors are not correlated with any other 
sensors. Therefore, there is a potential to add redundant sensors for the useful sensor. 
Tree cut threshold
Correlation distance
 
Figure 20: An Excerpt of the Correlation Tree for the Aircraft Data 
 
 57 
 
Feature Index
 
3.3 Summary 
Correlation tree cut (CTC) feature selection method was developed in this dissertation. 
It can select useful features under indeterminate operating conditions even when some 
features are correlated. In the simulation study, the CTC method was not affected by 
the number of noise and correlated features, and its overall performance is superior 
than that of benchmark methods. In the experimental study, under indeterminate 
operating conditions, the method provides the highest detection accuracy with the 
smallest standard deviation. In the field study, it chose a single feature and achieved 
the highest accuracy. The feature correlation tree constructed in the feature selection 
method provides an approach to identify redundancy among features. One feature 
from each cluster is adequate for modeling. Remaining features can be used for needed 
redundancy. 
 
CTC feature selection is performed without the requirement on domain knowledge of 
the system. However, the initial set of features to be selected by CTC can benefit from 
the use of domain knowledge. The initial set of features should include fault sensitive 
features, and otherwise the result of CTC will be useless. To ensure fault sensitive 
features are included, domain knowledge can be used to generate a large set of features 
that have been proven useful theoretically or experimentally.  
 
 58 
 
 
Chapter 4: Anomaly Detection Using Unimodality-Based 
Clustering 
The focus of this research is semi-supervised anomaly detection. That is, only healthy 
training data are used and the training data from anomalies are not required. This focus 
is based on the situation of machinery data: the health states of the data are usually 
labeled using domain knowledge based on maintenance activities. If no fault is found, 
the data collected between two adjacent maintenances are labeled as healthy. If a fault 
is found, the data collected before the maintenance for a certain period are regarded as 
faulty, and the data collected for a period after the corrective maintenance are labeled 
as healthy. The periods before and after the maintenance for the determination of the 
faulty and healthy data are usually judged by engineering experience and thus 
uncertainties are introduced. To avoid increasing the risk of including faulty data to 
the set of healthy data, healthy data can be selected only from the period between two 
adjacent maintenances that no fault is found. Although this practice abandoned the 
data collected before the maintenances where faults are identified, healthy data are 
usually abundant to afford the abandoning of the uncertain data. However, the 
uncertainty involved in labeling faulty data cannot be avoided in the same way. Even 
when two adjacent maintenances identified the same fault, the data between them are 
still mixed with both healthy and faulty data. Moreover, a machinery system involves 
multiple failure modes, and some of them only occur once during the whole life cycle 
of the system. As a result, it is unlikely to get a data set covering a full range of failure 
modes for training. In sum, healthy training data are widely available and faulty 
training data are not, and thus using semi-supervised anomaly detection is practical. 
 59 
 
 
 
One-class SVM and semi-supervised KNN have been applied as benchmark methods 
in semi-supervised anomaly detection due to their wide range of application and 
satisfactory results. However, these methods are sensitive to the choice of hyper-
parameters that it is difficult to reach a trade-off between specificity and generality 
without the supervision of labeled anomalous data, which are usually not available at 
the stage of anomaly detection.  
 
Clustering-based methods is semi-supervised and they do not have the problem of one-
class SVM and KNN. They model the data under a state of operating conditions as a 
cluster. Each cluster is a group of observations generated by a state of operating 
conditions when the system is healthy. Thus, a cluster is a generalization of data under 
a state of operating conditions. The trade-off between specificity and generality is 
reached by identifying the data associated with different states of operating conditions 
and by generalizing the data for a given state of operating conditions. After clustering, 
a test observation is detected as an anomaly if it is not a member of any cluster. A 
general procedure of clustering-based anomaly detection is illustrated in Figure 21. 
 60 
 
 
Phase 1: Clustering the data Phase 2: If a test observation is not a
member of any cluster, it is an anomaly.
5.5 5.5
Anomaly boundaries
5 5
4.5 4.5
*
4 4
3.5 3.5
3 3
2.5 2.5
0.005 0.01 D0.i0m15 1 0.02 0.025 0.005 0.01 D0.i0m15 1 0.02 0.025
 
Figure 21: Procedure of Clustering-Based Anomaly Detection 
 
4.1 Development of Unimodality-Based Clustering 
Among various clustering approaches, partition-based clustering and hierarchy-based 
clustering are widely used in the condition monitoring of machinery systems. 
Representative partition-based clustering algorithms include the k-means clustering 
family, and representative hierarchy-based clustering algorithms include the linkage 
clustering family. Both approaches require that the number of clusters is known in 
advance either directly or indirectly. Determining the number of clusters gives rise to 
the research on clustering validation, where the clustering, given a certain choice of 
the number of clusters, is evaluated to determine if the clustering is successful under 
validation criteria[91]. 
 
Established methods for determining the number of clusters include the empirical 
approach, the cluster similarity-based approach, and the distribution-based approach. 
The empirical approach determines the number of clusters either using domain 
 61 
 
Dim 2
Dim 2
 
knowledge about the data [92] or the elbow method, which tries to make a trade-off 
between the number of clusters and the variance explained from clustering. User 
experience is required for every specific clustering, and determining the threshold for 
the elbow method also requires prior understanding of the data although the threshold 
often cannot be identified [93]. Thus, the empirical approach is not suitable for 
automated clustering without user interference. The issue of elbow method is 
illustrated in Figure 22, where Iris data were used. The Iris data have 3 classes. When 
the data are clustered without using the labels, 3 clusters should be identified. Elbow 
method found both 2 clusters and 3 clusters look like the elbow. 
Both points look like
the elbow
Number of clusters
 
Figure 22: The Issue of Elbow Method 
 
In the similarity-based approach, the number of clusters is determined by maximizing 
the ratio of intra-cluster similarity to the inter-cluster similarity. Representative 
methods include the silhouette index (SI) [94][95] and gap index [96]. SI takes value 
in [-1, 1]. Higher value indicates better result. To use SI, a clustering algorithm such as 
k-means is run by trying different number of clusters. The number of cluster gives the 
largest SI is selected. For a clustering result, SI is calculated as: 
 62 
 
Average within cluster sum 
of squared distance
 
b − a
 SI =  (5) 
max(a,b)
where a is the mean of intro-cluster distance, and b is the mean of the distances of 
points to its nearest-clusters. 
 
Similarity-based approach does not use an application-specific definition of clusters 
and the clustering result is often not valid. For example, SI is liable to choose a larger 
number of clusters, and gap statistics may not choose the optimum number of clusters 
if different dimensions have different scales. I 
 
In the distribution-based approach, every cluster is assumed to be sampled from a 
parametric distribution, such as a Gaussian distribution [97] or a mixture of parametric 
models [98]. Distribution-based clustering methods give an unambiguous definition of 
each cluster by describing it with a distribution model. However, the data from a 
cluster often cannot be described by a known parametric distribution.  
 
Distributions commonly observed in the health monitoring of machinery systems 
include normal, lognormal, Rice, Rayleigh, Nakagami, Student’s t, and truncated 
versions of these distributions  [99]. The common characteristic of these distributions 
is that they are unimodal. Based on this idea and the distribution-based approach, in 
this dissertation a cluster is defined as a group of observations following a unimodal 
distribution in every dimension. Thus, if a dataset has M modes, it can be partitioned 
 63 
 
 
into M clusters of unimodal data. A unimodality-based clustering method was 
developed accordingly in this dissertation: different clustering partitions are generated, 
and the one validated by the unimodality test is the optimal partition. Compared with 
normality-based methods, the unimodality-based clustering method has a wider range 
of applications by considering all unimodal distributions. Based on unimodality-based 
clustering, an anomaly detection method is developed. 
 
A unimodal distribution is a probability distribution that has a single mode, which 
means a single value appears most frequently without a local maximum. For the 
probability density function (pdf) f(x) of a unimodal distribution with mode x = m, f(x) 
is monotonically increasing for x ≤ m, and monotonically decreasing for x ≥ m. 
Intuitively, the pdf of a unimodal distribution has only one peak. However, due to the 
disturbance of noise, the pdf of the actual unimodal data may not strictly have one 
peak, which challenges the validation of unimodality. 
 
In this research, a cluster is defined as a group of observations that follow a unimodal 
distribution in every dimension. This definition addresses the challenges of 
distribution-based clustering by treating the clusters as following a general form of 
unimodal distribution, which includes the case where the data cannot be described by a 
parametric unimodal distribution and the case where the data consists of clusters from 
different types of unimodal distributions.  
 
 64 
 
 
If a set of multimodal data is correctly partitioned, the data in each cluster follow a 
unimodal distribution in every dimension, as illustrated in Figure 23: the data were 
simulated to have two modes. There would be at least one dimension where the 
original data are not unimodal, and when the data are partitioned to 2 clusters, the data 
in a cluster are unimodal, as shown in dimension 1. 
The data in one The data are not 
cluster are unimodal. unimodal.  
Figure 23: Unimodality and Clusters 
 
To develop a unimodality validation method, the properties of unimodal distributions 
are investigated. The cumulative distribution function (cdf) has the following 
properties: first, the cdf is non-decreasing. Second, the cdf converges to 0 and 1. Third, 
the cdf of a unimodal distribution has only one inflection point. The first two 
properties hold for any distribution, and the third property is unique for unimodal 
distributions. The cdf satisfies these properties in the shape of a sigmoid function 
bounded between 0 and 1. In general, for a data with m modes, the number of 
inflection points is 2m – 1. For example, for a bimodal distribution, m = 2, and there 
 65 
 
 
are 3 inflection points. For a unimodal distribution, m = 1, and there is one inflection 
point. 
 
If a dataset is unimodal, its empirical distribution converges to its underlying unimodal 
distribution. In general, if the empirical distribution of data x with n observations is 
Fn(x), and the underlying distribution is F(x), according to the law of large numbers, 
Fn(x) converges to F(x), as described in (6) and (7). 
n
 Fn (x) = Pn (X  x) =
1 I (X  n i  x) (6) 
i=1
 sup | Fn (x)−F(x) |→ 0  (7) 
x
where I (y) = 1 for y = True, and 0 for y = False. n is the number of observations. 
 
When the data are sampled from a unimodal distribution G(x), the underlying 
distribution F(x) =  G(x): 
 d = sup | Fn (x) −G(x) |→ 0 (8) 
x  
The supremum of the absolute difference d between the empirical distribution Fn(x) 
and the G(x) is the Kolmogorov–Smirnov (KS) statistic. d should converge to 0 if the 
null hypothesis is true that G(x) is the underlying distribution. Otherwise, d is positive. 
A smaller d indicates the data are more likely being sampled from G(x). Under the null 
hypothesis, nd  follows KS distribution, and therefore the KS test can be used to 
perform the hypothesis test.  
 66 
 
 
 
Based on the idea of the KS test, Hartigan’s dip test [100] was developed to test 
unimodality. The dip test employs an empirical procedure to find a unimodal 
distribution that is most similar to the empirical distribution of the data by minimizing 
the KS statistic d. Then a hypothesis test is performed to determine if the data are 
sampled from the identified unimodal distribution.  
 
The dip statistic was defined as the maximum difference in terms of KS distance 
between the empirical distribution function and the unimodal distribution function that 
minimizes that maximum difference. The dip statistic of a distribution function F is as 
defined in (9) 
 r(F) = (F ,)  (9) 
where r(F) is the dip statistic of F. Ω is the class of unimodal distributions. For any 
bounded functions F and G, ρ(F, G) is defined in (10). For any class Λ of bounded 
functions, ρ(F, Λ) is defined in (11) 
 (F ,G) = sup x F(x) −G(x)  (10) 
 (F,) = inf (11) G (F,G)  
In the unimodality test, F is replaced by the empirical distribution function of the data 
under test. Equation (5) is the calculation of the KS distance of the empirical 
distribution to a unimodal distribution G. The KS distance ρ(F, G) is the largest 
absolute difference of F and G, as shown in Figure 24. 
 67 
 
 
1  
KS distance
0.8 Unimodal
distribution G(x)
0.6
Empirical
0.4
distribution Fn(x)
0.2
0 
2 4 6 8 10
DataD avtaalue
 
Figure 24: Dip Statistic 
 
In Equation (6), G is the unimodal distribution that results in the smallest dip statistic 
of F. To find G, the properties of unimodal distributions are investigated. As a 
cumulative distribution function (cdf), G is non-decreasing and it converges to 0, and 1, 
as shown in Figure 25. G has an additional property: it has only one inflection point. 
Therefore, G has the shape of a sigmoid function. 
 
Data
 
Figure 25: The Shape of Unimodal Distributions 
 
It is not feasible to search all unimodal distributions for G in (6), and thus Hartigan 
and Hartigan [100] developed an estimate of G as a curve in the shape of a sigmoid 
 68 
 
cdf
cdf
Cumulative probability
 
function. The sigmoid curve minimizes the KS distance between this curve, and the 
empirical distribution of the data is determined as the estimate of G. 
 
A smaller dip statistic of a test dataset indicates the data are closer to a unimodal 
distribution. The dip statistic changes with the sample size of the data under test. To 
evaluate the unimodality of datasets with different sizes, the significance of the dip 
statistic is used, which is invariant with the sample size. It measures the chance that 
the dataset has a smaller dip statistic than the reference data of the same size from a 
uniform distribution. A uniform distribution is an extreme case of unimodal 
distribution; it is between unimodal and multimodal. Hartigan and Hartigan [100] 
proved that the dip statistic is stochastically larger for the uniform distribution than for 
any other unimodal distributions. If the test dataset has a dip statistic smaller than that 
of uniform distribution with a probability p, the test data are regarded as unimodal 
with probability p, which is the significance. For example, if the significance p = 0.5, 
it means the test dataset has a 50% chance that its dip statistic is smaller than that of 
the samples from a uniform distribution. By setting up a threshold of the significance, 
the unimodality of the clusters can be evaluated. 
 
The unimodality test using the dip statistic significance is integrated with a clustering 
algorithm to construct a clustering method that automatically estimates the number of 
clusters. K-means clustering, due to its wide application range and the large amount of 
variant algorithms [101], is integrated as a demonstration. Using other algorithms in 
the unimodality-based clustering procedure is similar. The choice of clustering 
 69 
 
 
algorithms depends on the application. For example, if the data are nonlinearly 
correlated or membership scores are needed, kernel fuzzy c-means clustering is a 
better choice, as in [102]. 
 
K-means clustering finds cluster centroids and identifies data observations that belong 
to the clusters of the centroids according to the distances of the observations to the 
centroids. An observation belongs to the cluster for which the observation has the 
shortest distance to the centroid. For a given set of centroids, the clustering criterion F 
is provided by (12). 
k
 F = arg min || x j − i ||
2
 (12) 
S i=1 x jSi
where S = {S1, S2, …, Sk} are k sets of clustered data; (x1, x2, …, xn) are n observations 
to be clustered; and μi is the centroid of Si. 
 
To find the centroids of the clusters, Lloyd’s algorithm is usually used. Centroids for 
the initial clusters are assigned randomly. When the distances between each 
observation and the centroids are calculated, the membership of the observations in the 
clusters is reassigned so that an observation becomes an member of a cluster with the 
nearest centroid. After that, the centroids of the updated clusters are calculated. This 
process is repeated until the membership of the data points in the clusters does not 
change any more. 
 
 70 
 
 
In the unimodality-based clustering developed in this dissertation, k-means is 
implemented to partition the data into a small number of clusters, and then the 
unimodality of the data from each of the k clusters is tested on all dimensions. If the 
data from all the k clusters pass the unimodality test, the number of clusters is 
determined as k. Otherwise, the number of clusters is increased to k + 1, and the 
procedure is repeated. To avoid outliers being identified as a unimodal cluster, the size 
of each cluster is examined. If the cluster size is smaller than a certain number, it is not 
regarded as a valid cluster, and the method would repeat the procedure by using a new 
set of initial centroids for k-means. The procedure for the method is shown in Figure 
26.  
 
Start the number of cluster k = 0
k = k + 1
k0 = k
Partition the data into k0 clusters 
using k-means
Yes
Do the clusters No
satisfy unimodality 
criterion?
Yes
No Does each cluster 
have more than N0
members?
Yes
Clustering completed.
The data has k0 clusters  
Figure 26: Flowchart of Unimodality-Based Clustering 
 
 71 
 
 
4.2 Evaluation of Unimodality-Based Clustering 
Clusters with different degrees of overlaps, dimensionalities, and distributions were 
simulated to evaluate the performance of unimodality-based clustering. It was 
compared with silhouette-based clustering and gap-based clustering due to their 
widely reported effectiveness [91] [95] [103]. In silhouette-based clustering, different 
partitions from k-means clustering are tried and the partition that maximizes the mean 
silhouette coefficient is selected [94]. In the same way, the partition that maximizes 
the gap statistic is selected by gap-based clustering [96]. 
 
Adjusted Rand score was applied to measure the clustering result. Adjusted Rand 
score measures the similarity between true cluster labels and estimated cluster labels. 
Adjusted Rand score is based on Rand score, which is defined in the following 
equation: 
a + b
 R = (13) 
a +b + c + d  
where R is Rand score; a is the number of times a pair of observations belongs to the 
same cluster for both the estimated clustering result and the actual clustering result; b 
is the number of times a pair of observations belongs to different clusters for both the 
estimated clustering result and the actual clustering result; c is the number of times a 
pair of observations belongs to different clusters for the estimated clustering result but 
in the same cluster for the actual clustering result; d is the number of times a pair of 
observations belongs to the same cluster for the estimated clustering result but in 
different clusters for the actual clustering result. 
 72 
 
 
 
Adjusted Rand score (AR) improves Rand score as shown in the following equation to 
suppress the score of random labeling. 
R −E(R)
 AR =  (14) 
max(R)−E(R)
A value of 0 indicates the estimated labels for the test data are randomly assigned, and 
a value of 1 indicates the estimate labels have a perfect match with the true labels [104] 
[105]. Compared with other metrics such as Rand score and V-measure [106], adjusted 
Rand score is not misled by the random guess of labels. In every evaluation, the 
simulation was repeated 10 times, and the mean value of the adjusted Rand score was 
calculated. 
 
In the first evaluation, the clustering performance on the data with different degrees of 
overlaps was evaluated. The simulated data have 4 equally spaced clusters. Each 
cluster had 100 2-dimensional observations sampled from a Gaussian distribution with 
the same standard deviation (σ). The distance between centroids was changed from 
0.5σ to 3σ to simulate different degrees of overlaps, as demonstrated in Figure 27. The 
results are shown in Figure 28. Unimodality-based clustering, silhouette-based 
clustering and gap-based clustering are denoted as UC, SC, and GC, respectively. All 
3 methods use the same procedure that the data are clustered using k-means with the 
number of clusters increasing iteratively. The best partition is selected by UC if all 
clusters satisfy unimodality criterion and by SC and GC if the optimal SI and GI are 
obtained, respectively. 
 73 
 
 
 
All 3 methods had increased scores when the separations among clusters were 
increasing, and they had a similar score. However, when the separations between 
clusters were small, UC had a similar score with SC and a higher score than GC. 
2σ 3σ
(a) (b)
2σ 3σ
Dimension 1 Dimension 1
 
Figure 27: Control Overlaps by Setting the Distance Between Centroids: (a) separate 
the clusters by 2σ; (b) separate the clusters by 3σ 
 
UC
GC
SC
Separation (σ)
 
Figure 28: Clustering Performance of Gaussian Clusters Using Unimodality-Based 
Clustering (UC), Silhouette-Based Clustering (SC), and Gap-Based Clustering (GC) 
 
 74 
 
Dimension 2
Adjusted Rand Score
Dimension 2
 
In the second evaluation, the performance of clustering non-Gaussian data was 
evaluated. The simulated data have 4 clusters. Each cluster had 100 2-dimensional 
observations sampled from a lognormal distribution with the same scale parameter σL. 
The distance between the clusters was controlled by the difference of the cluster 
location parameters, which was set to 5σL in this evaluation. Due to the skewed shape 
of lognormal distributions, the right tail of the distribution from one cluster can 
overlap with a nearby cluster even when the distance between the clusters is large 
enough to separate Gaussian clusters. The value of σ has changed from 0.1 to 0.6 to 
simulate data with different skewness, as demonstrated in Figure 29. The results are 
shown in Figure 30. All 3 methods had decreased scores when σL was increasing, 
where the data from every cluster became more skewed. When σL was smaller than 0.4, 
all 3 methods provided the same result. When σL was increasing from 0.4, UC and SC 
had similar scores, and they were decreasing slower than GC.  
(a) σL = 0.1 (b) σL = 0.4
Dimension 1 Dimension 1
 
Figure 29: Control the Shape of Clusters: (a) set the scale parameter σ to 0.1; (b) set 
the scale parameter σ to 0.4 
 
 75 
 
Dimension 2
Dimension 2
 
SC
UC
GC
σL
 
Figure 30: Clustering Performance of Lognormal Clusters 
 
In the third evaluation, the performance of clustering from the influence of 
dimensionality on Gaussian data was evaluated. Four clusters of Gaussian data were 
simulated, and each cluster had 100 observations. The distance between cluster 
centroids was set to 2σ. The dimensionality was increased from 2 to 30. The results are 
shown in Figure 31. With the increase of dimensionality, the scores of UC and GC 
increased asymptotically, and the score of SC was decreasing. 
GC UC
SC
Dimensionality
 
Figure 31: Clustering Gaussian Data of Different Dimensions 
 
 76 
 
Adjusted Rand Score
Adjusted Rand Score
 
In the fourth evaluation, the performance of clustering from the influence of 
dimensionality on non-Gaussian data was evaluated. Four clusters of lognormal 
clusters were simulated, and each cluster had 100 observations. The difference of 
cluster location parameters was set to 5σL. The scale parameter was set to 0.5. The 
dimensionality was increased from 2 to 30. The results are shown in Figure 32.  When 
the dimensionality was increasing, UC provided a consistent score; SC decreased 
asymptotically; GC did not work when the dimensionality was small, and as the 
dimensionality increased, GC caught up with the performance of UC and SC. 
GC UC
SC
Dimensionality
 
Figure 32: Clustering non-Gaussian Data of Different Dimensions 
 
From the above 4 evaluations, although UC does not provide the highest score in all 
cases, it has the smallest negative influence from cluster overlapping, non-Gaussian 
data, and dimensionality compared with SC and GC. Therefore, UC can work as a 
general-purpose clustering algorithm. Especially, when the properties of the data are 
unknown, UC is the best choice. For specific applications, UC is the best choice for 
high dimensional non-Gaussian data. For Gaussian data and low dimensional non-
Gaussian data, UC does not have advantage over SC or GC. 
 77 
 
Adjusted Rand Score
 
 
4.3 Development of the Anomaly Detection Method Using 
Unimodality-Based Clustering 
An anomaly detection method was developed using unimodality-based clustering, as 
described in Figure 33. At first, the normal reference data are projected to the feature 
space where the anomaly detection will be conducted. Then the normal reference data 
are clustered using unimodality-based clustering. After that, a test data observation is 
projected to the same feature space. An anomaly indicator is calculated by measuring 
the distance between the test data to its nearest cluster. If the test data observation is 
not a member of the nearest cluster, it is identified as an anomaly. The idea can be 
summarized as a hypothesis test. The null hypothesis is that the test observation is a 
member of the clusters, and the alternative hypothesis is that the test observation is not 
a member of the clusters. In the fault detection of machinery systems, the null 
hypothesis corresponds to the hypothesis that the system is healthy, and the alternative 
hypothesis corresponds to the hypothesis that the system is faulty. 
 78 
 
 
2 (a) 2 (b)
1 1
0 0
-1 -1
-2 -2 Clusters
-2 -1 0 1 2 -2 -1 0 1 2
DDiimmeennsisoino 1n x DiDmimeennssiioon n1 x
2 (c) 2 (d)
1 1
0 0
-1 -1
Anomaly 
-2 Test data -2 indicator
-2 -1 0 1 2 -2 -1 0 1 2
DDiimeennsisonio 1n x DiDmimeennssioino 1n x
 
Figure 33: Clustering-Based Anomaly Detection 
(a) project normal reference data to the feature space; (b) partition the normal 
reference data into clusters; (c) project test data to the feature space; and (d) calculate 
the anomaly indicator of the test data. 
 
 79 
 
DiDmimeennssiioonn 2 y DimDimenenssiioonn 2 y
DiDmimeennssiioonn 2 y DimDimenenssiioonn 2 y
 
Normal 
reference data
Unimodality-based 
Test data
clustering
Clustered 
normal data
Calculate the anomaly Calculate the anomaly  
indicator of the normal data indicator of the test data
Determine anomaly 
threshold
Test data anomaly Yes Anomaly 
indicator > threshold? detected
No
The test data 
are normal
 
Figure 34: Anomaly Detection Procedure Using Unimodality-Based Clustering. 
 
Figure 34 explains the details of the anomaly detection method. In the procedure, there 
are 3 major processing steps: unimodality-based clustering, anomaly indicator 
calculation, and anomaly threshold determination. An anomaly indicator is a measure 
of how an observation deviates from the normal reference data. Distance measures are 
widely used anomaly indicators. For example, Mahalanobis distance (MD) has been 
applied as an anomaly indicator to measure the probability of a test observation being 
a member of the normal reference data. If the reference data are non-Gaussian 
unimodal, the MD value loses its original meaning in terms of probability. However, 
because MD takes the reference data covariance into calculation, as in (15), the shape 
 80 
 
 
of the reference data distribution has less influence on MD than on other distance 
measures such as Euclidean distance. 
 dM (x) = (x − )S
−1(x − )T (15) 
 
where dM(x) is the MD of an observation x = (x1, x2, …, xN) to the reference data with 
mean μ = (μ1, μ2, …, μN) and covariance matrix S. N is the dimension of the data.  
 
In the dissertation, the anomaly indicator of a test data observation is the MD of this 
observation to its nearest cluster from the normal reference data, as shown in Figure 35. 
To test if the test data observation is a member of the nearest cluster, a hypothesis test 
based on the anomaly indicator is set up. The null hypothesis is, the anomaly indicator 
value of the test data observation is sampled from the distribution of the anomaly 
indicator of the nearest cluster from the normal reference data. 
Nearest cluster of healthy data
8
6
4
Test point
2
0
-2
-4
-6
-15 -10 -5 0 5 10 15
Dimension 1  
Figure 35: Using MD as Anomaly Indicator 
 
 81 
 
Dimension 2
 
To set up the anomaly threshold for the hypothesis test, the data from the selected 
nearest cluster are sampled to generate two datasets. One dataset is used as normal 
reference, and the other is used as test set. Anomaly indicator values for all the test set 
observations to the reference set are calculated, and then the distribution of these 
anomaly indicator values is estimated. The hypothesis test can then be conducted by 
choosing percentile of the data, which corresponds to the false positive rate selected by 
the user. If the anomaly indicator value of the test data observation is higher than the 
threshold value identified by the significance level, the null hypothesis is rejected and 
thus anomaly is detected. The threshold value is the hyperparameter. Its value can be 
determined by maximizing an anomaly detection performance metric [107] if training 
data for both normal and anomalies are given. A commonly used performance metric 
is anomaly detection accuracy, which is the ratio of the number of test observations 
assigned to the correct labels to the total number of test observations. Because 
anomaly data are usually unavailable, false positive rate is a practical measure to set 
up the threshold. The false positive rate equals to 1-(percentile/100). A larger false 
positive rate means more healthy data would be detected as anomalies by mistake and 
less true anomalies would be missed. In practice, the user can choose the largest 
affordable false positive rate as a start to avoid the risk of false negative detection, 
which is more destructive than false positive detection, and gradually change the 
threshold to reduce the false positive rate as the information of false negative rate is 
obtained during the monitoring. 
 82 
 
 
0.25  
0.2
Threshold: 99.5 percentile
0.15
0.1
0.05
0  
00 5 10 15 20 25 30 35
IndicaDtaotar value
 
Figure 36: Determination of the Anomaly Threshold 
 
In the health monitoring of machinery systems, when operating conditions are 
changing, the data have multiple modes. The unimodality-based clustering partitions 
the data accordingly, and the anomaly detection method is implemented as in Figure 
34. When the operating conditions are stationary, in most cases the healthy data have 
only one mode, and the unimodality-based clustering regards all the healthy data as a 
single cluster, which is the nearest cluster to the test observation. In some cases, the 
healthy data have multiple modes. For example, a healthy system may have more than 
one resonance excited at the same time even when the operation conditions are 
stationary, and thus the health monitoring data have multiple modes. These data are 
partitioned into multiple clusters using unimodality-based clustering, and anomaly 
detection is still performed as in Figure 34. In all the cases, a cluster is a subspace of 
the healthy reference. If a test observation is not within a subspace, it is an anomaly. 
 
 83 
 
Density
Probability Density
 
4.4 Evaluation of the Anomaly Detection Method 
The anomaly detection method was evaluated by simulated data, benchmark data, and 
experimental data.  
4.3.1 Evaluation of the Anomaly Detection Method Using Simulated Data 
Normal reference data of 3 clusters were simulated. Each cluster consisted of 100 2-
dimensional observations from a Gaussian distribution. Using unimodality-based 
clustering, labels were assigned to the observations of the 3 clusters, as shown in 
Figure 37 (a). After calculating the anomaly indicator values, a lognormal distribution 
was fit to the anomaly indicator values of the normal data, and a significance level of 
0.05 was used to set up the threshold. 
 
Three sets of test data were simulated, and each set had 10 observations, as shown in 
Figure 37 (b). One of them consisted of normal data. The rest of the datasets were 
generated by not using the distributions of the normal clusters, and they were used as 
anomalies. These anomalies were generated to have overlaps with the normal 
reference data. By comparing the anomaly indicator value of the test data to the 
threshold, 19 of the 20 actual anomalies were detected, and 1 of the 10 actual normal 
observations was mistakenly detected as anomalies. The anomaly detection accuracy 
was 93.3%.  
 84 
 
 
Test data (anomalies)
30 30
(a) (b)
25 Cluster 3 25
20 Cluster 2 20
15 15
10 10
5 5
0 0
Test data 
-5 -5
Cluster 1 (normal)
-10 -10
-20 -10 0 10 20 -10 0 10 20 30
Dimension 1 Dimension 1
 
Figure 37: Simulated Normal Reference Data and Test Data 
 
The results were compared with a conventional method without clustering normal 
reference data, and the anomaly indicator value is the MD value of a test data 
observation to the whole normal reference data. The anomaly threshold was set up in 
the same way as in the anomaly detection using unimodality-based clustering.  
 
Using the conventional method, the anomaly detection accuracy is 76.6%, which is 
lower than the 93.3% accuracy from the method developed in this dissertation. The 
confusion matrix of the detection results is shown in Table I. The anomaly detection 
method using the unimodality-based clustering and the conventional method are 
denoted as ADU and CONV, respectively. The conventional method had fewer false 
positive detections. This is because the normal test data were surrounded by the 
normal reference data, the conventional method did not refine the reference, and any 
observation in the surrounded region is regarded as normal. If anomalies exist in the 
surrounded region, they will also be regarded as normal. On the contrary, the method 
developed in this dissertation refines the reference to be clusters that even if anomalies 
 85 
 
Dimension  2
Dimension  2
 
exist in the surrounded regions, they will be detected because they are not members of 
the clusters. 
 
Table 6: Results of Simulated Data Anomaly Detection 
 Detected Normal Detected Abnormal 
ADU/  CONV  ADU/CONV 
True Normal 9/10 1/0 
True Abnormal 1/7 19 /13 
 
4.3.2 Evaluation of the Anomaly Detection Method Using Simulated Data 
Three most widely used benchmark datasets were used for evaluation: Iris data, Wine 
data, and Breast Cancer data. For every benchmark dataset, the data from one class 
was used as anomalies, and the data from other classes were used as normal data. 
During training, class labels were removed, and a portion of the normal data were 
used as reference. Remaining normal data and all the abnormal data were used for 
testing. Evaluation metrics were obtained using 5-fold cross validation. Unimodality 
Clustering Anomaly Detection (UCAD) were compared with benchmark methods.  
 
The Iris dataset has 4 features with a sample size of 150 distributed in 3 classes. The 
data from the third class is designated as anomalies and they were not given in 
training. The data from the remaining two classes are concatenated as the normal data. 
 86 
 
 
Iris data are non-Gaussian. However, anomalies and normal data do not have 
significant overlaps. UCAD has the best result. 
Normal Abnormal
Dim 1
 
Figure 38: Iris Data 
 
Table 7: Results of Iris Data Anomaly Detection 
Iris Data UCAD 1-SVM KNN GMM
Mean accuracy 0.92 0.86 0.80 0.91
STD accuracy 0.07 0.08 0.17 0.07
False negative rate 0.02 0.14 0.08 0.04
False positive rate 0.11 0.14 0.26 0.11  
 
The Wine dataset has more features. It has 13 features with a sample size of 178 
distributed in 3 classes. The data from the first class is designated as anomalies and 
they were not given in training. The data from the remaining two classes are 
concatenated as the normal data. The normal and abnormal data are overlapped. 
UCAD has the highest mean accuracy and the smallest standard deviation. However, 
it has 14% false negative rate. 
 87 
 
Dim 2
 
Normal Abnormal
Dim 1
 
Figure 39: Wine Data 
 
Table 8: Results of Wine Data Anomaly Detection 
Wine Data UCAD 1-SVM KNN GMM
Mean accuracy 0.90 0.83 0.81 0.82
STD accuracy 0.06 0.11 0.09 0.09
False negative rate 0.14 0.22 0.10 0.03
False positive rate 0.08 0.15 0.23 0.25  
 
Breast cancer data have more features and more observations than Iris data and Wine 
data. It has 30 features with a sample size of 569 distributed in 2 classes. The data 
from the second class is designated as anomalies and they were not given in training. 
The data from the first class are designated as the normal data. The data are close to 
be Gaussian, and as a result, besides UCAD, GMM also gets similar results. 
 88 
 
Dim 2
Dim 2
 
Normal Abnormal
Dim 1
 
Figure 40: Breast Cancer Data 
 
Table 9: Results of Breast Cancer Anomaly Detection 
Breast Cancer Data UCAD 1-SVM KNN GMM
Mean accuracy 0.90 0.90 0.83 0.90
STD accuracy 0.05 0.03 0.05 0.04
False negative rate 0.09 0.10 0.03 0.07
False positive rate 0.10 0.10 0.25 0.12  
 
For all the 3 benchmark datasets, UCAD consistently provided the highest mean 
accuracy. It also had the smallest STD accuracy for the Iris data and the Wine data. 
Only in the analysis of the Breast Cancer data, 1-SVM and GMM had smaller STD 
accuracy than UCAD. However, 1-SVM, KNN, and GMM are sensitive to the choice 
of hyper-parameters. To set up their parameters, grid-search was applied where some 
labeled anomaly data were used. UCAD selected an acceptable false positive 
detection rate, such as 5% in the benchmark data analysis, and avoided the 
requirement on the labeled anomaly data. 
 
 89 
 
Dim 2
 
4.3.2 Evaluation of the Anomaly Detection Method Using Experimental 
Data 
Fault detection is a main concern in the maximization of the availability of wind 
turbines [108]. An experiment was conducted on a wind turbine simulator at the 
Korea Institute of Machinery and Materials. A wind turbine gearbox was tested 
because it is the top contributor to wind turbine downtime [109]. Due to the complex 
environment, the operating conditions of wind turbine gearboxes are usually changing 
and may not be monitored. The unimodality-based clustering method can partition the 
data into clusters such that each cluster corresponds to a state of the operating 
conditions, and thus the anomaly detection method developed in this dissertation can 
be implemented. 
 
To implement the anomaly detection method to a machinery system, the health 
monitoring data are usually preprocessed so the information concerning the fault can 
be extracted. In this experimental study, the necessary preprocessing steps are 
introduced, which include the setup of signal acquisition, signal preprocessing, the 
extraction of raw features from the signals, and feature space construction. 
 
The wind turbine simulator is shown in Figure 41. A motor was used to act as the wind 
power that drives the wind turbine. The gearbox connected to the main shaft of the 
wind turbine was tested. The gearbox has three stages. The pinion on the third stage, 
which is connected to the output shaft of the gearbox, was the component under test. 
Five tests were carried out. In the first test, a healthy pinion was installed to generate 
 90 
 
 
healthy data. In the remaining 4 tests, faulty pinions with a 1 mm root crack, pitted 
teeth, worn teeth, and a missing tooth were tested, as shown in Figure 42. The healthy 
pinion was replaced by one of the faulty pinions in turn to generate test data. In each 
test, rotation speed was changed several times and its value was not given in the 
analysis, simulating the situation of unmonitored operating conditions. The objective 
was to check if the method developed in this dissertation can identify the existence of 
different faults under changing operating conditions. 
Motor Flywheel Generator Controller
Gearbox
Main bearing
 
Figure 41: Wind Turbine Simulator 
 
Eight accelerometers were mounted at different locations on the chassis of the gearbox. 
Vibration signal was used because it is available for most wind turbine condition 
monitoring systems [110]. In each test, one accelerometer collected 480 s of data 
under the sampling rate of 25,600 Hz. The sampling rate leads to a Nyquist frequency 
of 12,800 Hz, which is high enough to capture the fault information carried by the 
meshing frequency and the resonance frequencies of the gearbox. 
 91 
 
 
Root crack
Pitting
 
Figure 42: Pinion Faults 
 
Vibration signals from gear meshing systems are modulated. Modulating frequency 
components have been widely used to indicate the gearbox faults. A major carrier of 
the gearbox modulated signal is the meshing frequency. Due to the low speed of the 
gearboxes in wind turbines, meshing frequency usually exists at the same frequency 
band as the low-frequency noise, and thus the modulated signal is contaminated. 
Therefore, wavelet thresholding de-noising was performed in this study. Figure 43 
shows a comparison of the original data and the de-noised data of the first 12 s of 
healthy data. 
 92 
 
 
Original
0.05
0
-0.05
0 2 4 6 8 10 12
Time (S)
Time (s)
0.05
De-noised
0.005
-0.050
0 2 4 6 8 10 12
-0.05 Time (S)
0 2 4 6 8 10 12
Time (S)
Time (s)  
0.05
Fig0ure 43: De-Noising Result 
 -0.05
0 2 4 6 8 10 12
Time (S)
After denoising, features were extracted to give an account of the gearbox health state 
from various aspects. If all 480 s of data were analyzed as one dataset to provide an 
observation of features, the analysis would face two shortcomings: first, one 
observation of features is not adequate to draw a conclusion on the health state of the 
system. Second, the dynamics of the system, such as short-period fluctuations within 
480 s, cannot be captured. Thus, the data were cut into 300 segments, and each 
segment generated one observation of features, so there were 300 observations after 
feature extraction. Every segment had 1.6 s of data, resulting in a frequency resolution 
of 0.625 Hz, which is fine enough to differentiate different frequency components in 
the gearbox signal.  
 
For every signal segment from one of the 8 sensors, 10 widely used features were 
calculated [111]. Six features were in the time domain, including peak-to-peak, root-
 93 
 
Magnitude (g) AMmapgnliittuuMddea ge(gn ()itgud)e (g) AmMapglniittuudde e(g ()g)
 
mean-square (rms), standard deviation, crest factor, skewness, and kurtosis of the 
signal amplitude. Four features were in the frequency domain, including peak 
magnitude and rms of frequency components, peak magnitude and rms of the 
frequency components of the enveloped signal. Altogether, the 8 sensors generated 80 
features. There are more sophisticated wind turbine gear fault features reported in 
different sources [112] [113]. These sophisticated features were not used in this study 
considering they are not available in many existing systems.  
 
After raw feature extraction, the data were grouped into normal reference data and test 
data. The normal data were uniformly sampled without replacement to form two parts: 
the first part consisted of 80% of the normal data and was used as the normal reference 
data; the second part consisted of 20% of the normal data and was used as test data. 
All the faulty data were used as test data because in actual applications labeled faulty 
data are usually not available. 
 
The feature space was spanned using the normal reference data by principal 
component analysis (PCA), which reduces the redundancy and the dimensionality, as 
introduced in [76]. The values of raw features have different scales so they are 
normalized by calculating the Z-score before PCA. The cumulative sum of the 
variance for the first 4 principal components accounts for more than 90% of the total 
variance, and they were used to span the feature space where the anomaly detection 
would be performed. The healthy data were projected to the feature space. Dimension 
1, dimension 2, and dimension 3 correspond to the first 3 principal components. 
 94 
 
 
 
The unimodality-based clustering method developed in this study identified the 
healthy data in the feature space has 4 clusters. This result is consistent with the fact 
that the data were generated under 4 different rotation speeds. The healthy data were 
partitioned accordingly, as shown in Figure 44. 
4 Cluster 1 Cluster 2
2
0
-2
-4 Cluster 4
10
Cluster 3
5 20
10
0 0
-10
-5 -20
 
Figure 44: Clustering Result of the Normal Reference Data 
 
Anomaly indicator values were calculated using the method introduced in Section IV. 
Using the KS test with a significance level of 0.05, the hypothesis that the healthy data 
anomaly indicator values follow a lognormal distribution was accepted. Using a 
significance level of 0.05 on the fitted lognormal distribution for the anomaly 
detection hypothesis testing, the anomaly threshold was set up. Then the test data were 
projected to the feature space for the calculation of their anomaly indicator values, as 
demonstrated in Figure 45, where the data from the pinion with 1 mm root crack were 
plotted. 
 95 
 
Dimension 3
 
Normal reference 
data clusters
50
0
-50
10
5 Test data 15
0 10
-5 5
0
-10
-5
-15 -10
-20 -15
 
Figure 45: Projecting Test Data to the Feature Space 
 
The anomaly detection method using unimodality-based clustering identified all the 
data from the faulty pinions as anomalies. Among the 60 test observations of normal 
data, 2 observations were misidentified as anomalies, resulting in a false positive rate 
of 3.3%. This is due to the choice of anomaly threshold using the significance level of 
0.05, which means the expected false positive rate is 5%. The details of the analysis 
results are listed in Table 10. 
 
Table 10: Results of Experimental Data Anomaly Detection 
Experimental UCAD 1-SVM KNN GMM
Mean accuracy 0.99 0.99 0.95 0.96
STD accuracy 0.02 0.03 0.06 0.04
False negative rate 0.00 0.00 0.01 0.00
False positive rate 0.03 0.03 0.10 0.10  
 
 96 
 
Dimension 3
 
4.4 Summary of Anomaly Detection 
Unimodality-based clustering developed in this research provides consistent result in 
different cases of cluster overlapping, dimensionality, and different types of 
distributions. In the simulation study, the clustering method was not the best in every 
case. However, it provided consistent result and it was one of the best methods in all 
cases. The anomaly detection method addresses the challenge from indeterminate 
operating conditions. In the analysis of simulated data, UCAD shows superiority over 
the conventional method without clustering the healthy reference data. In the analysis 
of benchmark data and experimental data, the anomaly detection method provided one 
of the best results in all cases. Although 1-SVM can also provide comparable results, it 
requires anomaly data for hyper-parameter optimization.  
 
As evaluated in the simulation study the unimodality-based clustering used in UCAD 
has the best performance, relative to other methods, when the healthy training data are 
high-dimensional and the clusters are non-Gaussian. Because machinery condition 
monitoring data are usually high-dimensional and non-Gaussian, UCAD is the best 
choice for machinery anomaly detection. In cases with different dimensions, different 
overlaps of clusters, and different distributions of clusters, the unimodality-based 
clustering does not have significant deterioration of performance but other methods do. 
Therefore, when the knowledge about the data to be processed is incomplete, UCAD 
is the best choice. 
 
 97 
 
 
Since the training of UCAD depends on unimodality-based clustering that k-means 
clustering is performed iteratively, the training process is not efficient in computation 
when the actual number of clusters is large. Therefore, UCAD is not suitable for in-
situ training. Although it was not designed to accept data stream for training, the 
healthy model of UCAD can still be updated periodically when the data labeled as 
healthy become available. Thus, UCAD matches the application scenario of machinery 
maintenance that the labeled healthy data are not continuously flowing data stream but 
data samples collected after maintenance inspections. However, in cases where in-situ 
training is required, UCAD is not suitable.  
 
Chapter 5: Contributions and Future Work 
A feature selection method and an anomaly detection method were developed in this 
dissertation from different aspect of the task of anomaly detection under indeterminate 
operating conditions. It has following contributions. 
 
First, this dissertation developed a CTC feature selection method to identify fault 
sensitive features when some features are correlated, reducing false detection caused 
by improper selection of features. As shown in the evaluation, when the number of 
noise and redundant features increases, CTC consistently selects the smallest number 
of features and achieves the highest mean accuracy. Compared with PCA and 
manifold learning methods, the features selected by CTC maintain the original 
physical meaning. Moreover, the correlation tree and tree-cut method of CTC helps to 
 98 
 
 
discover the redundant features, which is required in safety-critical applications. CTC 
sees a wide range of applications including variable interpretation, model 
simplification, and sensor system improvement.  
 
Second, UCAD is the first semi-supervised method to perform anomaly detection 
under indeterminate operating conditions without anomaly training data, without 
monitoring operating conditions, and without human interference. It expands the 
application range with increased performance of current anomaly detection practices 
and contributes to the automation of PHM. 
 
Third, this dissertation developed a unimodality-based clustering method that provides 
a formal definition of the clusters contained in the machinery health monitoring data 
that clusters are samples of unimodal distributions. It thereby expands the normality 
assumption from one that has been conventionally employed to one that is more 
general. As a result, clustering is feasible for more applications.  
 
Finally, both CTC and UCAD are carried out automatically without human 
intervention. Since setting up feature selection and anomaly detection methods usually 
requires human intervention, CTC and UCAD reduces man hours and makes feature 
selection and anomaly detection easier to implement. Moreover, through this research, 
the industry can benefit from machine learning in the automation of maintenance 
practices. 
 99 
 
 
 
Both methods are designed to work under certain scenarios: CTC feature selection has 
the best performance compared with available methods when some features are 
correlated with each other. It does not have significant advantage if the features are not 
correlated. The application of CTC is also limited by the availability of training data. If 
training data for some classes are missing, CTC cannot be used. UCAD works best 
when the healthy data form unimodal clusters. If the clusters are not unimodal, UCAD 
does not have significant advantage over other clustering-based methods. Although 
UCAD does not require training data from anomalies, it requires a full coverage of 
healthy training data. 
 
Future work includes improving the CTC feature selection method for semi-supervised 
learning, adapting the UCAD for online updating, and the integration of domain 
knowledge. The CTC feature selection method only work in the supervised mode and 
therefore both healthy and faulty training data are required. This requirement may not 
be an issue at any early stage when the experiment can be carried out to generate the 
data, but for systems not affordable to get faulty training data, a semi-supervised CTC 
feature selection is needed. The current version of UCAD trains the model for the 
healthy data only once or periodically updates the model with newly collected healthy 
data. This property is appropriate for machinery condition-based maintenance but does 
not satisfy the needs of applications where in-situ updating of models is required. 
Therefore, a UCAD with in-situ updating capability using data stream should be 
developed in the future work. Both CTC and UCAD work without domain knowledge 
 100 
 
 
of the machinery. They are useful when the domain knowledge is not available. 
However, in real applications usually some domain knowledge is available. For 
example, for a given failure mode, the physics-of-failure model has been established. 
Physics-of-failure model can be used to identify sensitive features. Such domain 
knowledge has been used to narrow the search range of the subset of features. That is, 
a subset of features was selected using domain knowledge, and then a feature selection 
method such as CTC is applied to perform the final selection. This procedure requires 
human intervention and the domain knowledge is not integrated with the data-driven 
method. If the domain knowledge can be integrated automatically with CTC and 
UCAD, the errors in the analysis are expected to be reduced.  
 
During the research of the dissertation, following papers were published and two more 
papers were completed. These papers describe specific aspects related to the 
dissertation and together they depict the roadmap of the research, which is helpful for 
researchers interested in doing research in machinery anomaly detection. 
J. Tian, C. Morillo, M. H. Azarian, and M. Pecht, “Motor bearing fault detection 
using spectral kurtosis-based feature extraction coupled with K-nearest neighbor 
distance analysis,” IEEE Transactions on Industrial Electronics, vol. 63, no. 3, pp. 
1793–1803, 2016. 
J. Tian, M. H. Azarian, M. Pecht, G. Niu, and C. Li, “An ensemble learning-based 
fault diagnosis method for rotating machinery,” in Prognostics and System Health 
Management Conference (PHM-Harbin), 2017, 2017, pp. 1–6. 
 101 
 
 
J. Tian, M. H. Azarian, and M. Pecht, “Rolling element bearing fault detection using 
density-based clustering,” in Prognostics and Health Management (PHM), 2014 
IEEE Conference on, 2014, pp. 1–7. 
J. Tian, M. H. Azarian, and M. Pecht, “Anomaly Detection Using Self-Organizing 
Maps-Based K-Nearest Neighbor Algorithm,” in Proceedings of the European 
Conference of the Prognostics and Health Management Society, 2014. 
J. Tian, C. Morillo, and M. G. Pecht, “Rolling element bearing fault diagnosis using 
simulated annealing optimized spectral kurtosis,” in 2013 IEEE Conference on 
Prognostics and Health Management (PHM), 2013, pp. 1–5. 
 
 
Bibliography 
 
[1] S. Sheng et al., “Wind turbine drivetrain condition monitoring during GRC 
phase 1 and phase 2 testing,” National Renewable Energy Laboratory (NREL), 
Golden, CO., 2011. 
[2] C. Bianchini, F. Immovilli, M. Cocconcelli, R. Rubini, and A. Bellini, “Fault 
detection of linear bearings in brushless AC linear motors by vibration analysis,” 
IEEE Transactions on Industrial Electronics, vol. 58, no. 5, pp. 1684–1694, 2011. 
[3] W. Qiao and D. Lu, “A survey on wind turbine condition monitoring and fault 
diagnosis—Part I: Components and subsystems,” IEEE Transactions on Industrial 
Electronics, vol. 62, no. 10, pp. 6536–6545, 2015. 
 102 
 
 
[4] S. Cheng, M. H. Azarian, and M. G. Pecht, “Sensor systems for prognostics 
and health management,” Sensors, vol. 10, no. 6, pp. 5774–5797, 2010. 
[5] M. Pecht, Prognostics and health management of electronics. Wiley Online 
Library, 2008. 
[6] T. A. Garcia-Calva, D. Morinigo-Sotelo, and R. de Jesus Romero-Troncoso, 
“Non-Uniform Time Resampling for Diagnosing Broken Rotor Bars in Inverter-Fed 
Induction Motors,” IEEE Transactions on Industrial Electronics, vol. 64, no. 3, pp. 
2306–2315, 2017. 
[7] F. Dalvand, A. Kalantar, and M. S. Safizadeh, “A Novel Bearing Condition 
Monitoring Method in Induction Motors Based on Instantaneous Frequency of Motor 
Voltage,” IEEE Transactions on Industrial Electronics, vol. 63, no. 1, pp. 364–376, 
2016. 
[8] S. H. Kia, H. Henao, and G.-A. Capolino, “Gear tooth surface damage fault 
detection using induction machine stator current space vector analysis,” IEEE 
Transactions on Industrial Electronics, vol. 62, no. 3, pp. 1866–1878, 2015. 
[9] Z. Gao, C. Cecati, and S. X. Ding, “A survey of fault diagnosis and fault-
tolerant techniques—Part I: Fault diagnosis with model-based and signal-based 
approaches,” IEEE Transactions on Industrial Electronics, vol. 62, no. 6, pp. 3757–
3767, 2015. 
[10] Z. Gao, C. Cecati, and S. X. Ding, “A survey of fault diagnosis and fault-
tolerant techniques—part II: fault diagnosis with knowledge-based and hybrid/active 
approaches,” IEEE Transactions on Industrial Electronics, vol. 62, no. 6, pp. 3768–
3774, Jun. 2015. 
 103 
 
 
[11] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” 
ACM computing surveys (CSUR), vol. 41, no. 3, p. 15, 2009. 
[12] L. Zhou, F. Duan, E. Faris, and D. Mba, “Seeded Planetary Bearing Fault in a 
Helicopter Gearbox—A Case Study,” in International Conference Design and 
Modeling of Mechanical Systems, 2017, pp. 495–505. 
[13] Z. Huo, Y. Zhang, P. Francq, L. Shu, and J. Huang, “Incipient fault diagnosis 
of roller bearing using optimized wavelet transform based multi-speed vibration 
signatures,” IEEE Access, vol. 5, pp. 19442–19456, 2017. 
[14] J. Tian, C. Morillo, M. H. Azarian, and M. Pecht, “Motor Bearing Fault 
Detection Using Spectral Kurtosis-Based Feature Extraction Coupled With K-Nearest 
Neighbor Distance Analysis,” IEEE Transactions on Industrial Electronics, vol. 63, 
no. 3, pp. 1793–1803, Mar. 2016. 
[15] Z. Xia, S. Xia, L. Wan, and S. Cai, “Spectral regression based fault feature 
extraction for bearing accelerometer sensor signals,” Sensors, vol. 12, no. 10, pp. 
13694–13719, 2012. 
[16] H. Oh, J. H. Jung, B. C. Jeon, and B. D. Youn, “Scalable and Unsupervised 
Feature Engineering Using Vibration-Imaging and Deep Learning for Rotor System 
Diagnosis,” IEEE Transactions on Industrial Electronics, vol. 65, no. 4, pp. 3539–
3549, 2018. 
[17] I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” 
Journal of machine learning research, vol. 3, no. Mar, pp. 1157–1182, 2003. 
 104 
 
 
[18] K. Torkkola, “Feature extraction by non-parametric mutual information 
maximization,” Journal of machine learning research, vol. 3, no. Mar, pp. 1415–
1438, 2003. 
[19] R. Battiti, “Using mutual information for selecting features in supervised 
neural net learning,” IEEE Transactions on neural networks, vol. 5, no. 4, pp. 537–
550, 1994. 
[20] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual 
information criteria of max-dependency, max-relevance, and min-redundancy,” IEEE 
Transactions on pattern analysis and machine intelligence, vol. 27, no. 8, pp. 1226–
1238, 2005. 
[21] P. A. Mundra and J. C. Rajapakse, “SVM-RFE with MRMR filter for gene 
selection,” IEEE transactions on nanobioscience, vol. 9, no. 1, pp. 31–37, 2010. 
[22] J. Reunanen, “Overfitting in making comparisons between variable selection 
methods,” Journal of Machine Learning Research, vol. 3, no. Mar, pp. 1371–1382, 
2003. 
[23] S.-W. Lin, Z.-J. Lee, S.-C. Chen, and T.-Y. Tseng, “Parameter determination 
of support vector machine and feature selection using simulated annealing approach,” 
Applied soft computing, vol. 8, no. 4, pp. 1505–1512, 2008. 
[24] J. Yang and V. Honavar, “Feature subset selection using a genetic algorithm,” 
IEEE Intelligent Systems and their Applications, vol. 13, no. 2, pp. 44–49, 1998. 
[25] X. Wang, J. Yang, X. Teng, W. Xia, and R. Jensen, “Feature selection based 
on rough sets and particle swarm optimization,” Pattern recognition letters, vol. 28, 
no. 4, pp. 459–471, 2007. 
 105 
 
 
[26] R. Archibald and G. Fann, “Feature selection and classification of 
hyperspectral images with support vector machines,” IEEE Geoscience and Remote 
Sensing Letters, vol. 4, no. 4, pp. 674–677, 2007. 
[27] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for cancer 
classification using support vector machines,” Machine learning, vol. 46, no. 1, pp. 
389–422, 2002. 
[28] E. Romero and J. M. Sopena, “Performing feature selection with multilayer 
perceptrons,” IEEE Transactions on Neural Networks, vol. 19, no. 3, pp. 431–441, 
2008. 
[29] A. Painsky and S. Rosset, “Cross-Validated Variable Selection in Tree-Based 
Methods Improves Predictive Performance,” IEEE transactions on pattern analysis 
and machine intelligence, vol. 39, no. 11, pp. 2142–2153, 2017. 
[30] A. Krishnakumari, A. Elayaperumal, M. Saravanan, and C. Arvindan, “Fault 
diagnostics of spur gear using decision tree and fuzzy classifier,” Int J Adv Manuf 
Technol, vol. 89, no. 9–12, pp. 3487–3494, Apr. 2017. 
[31] H. Zou and T. Hastie, “Regularization and variable selection via the elastic 
net,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 
67, no. 2, pp. 301–320, 2005. 
[32] T. Abeel, T. Helleputte, Y. Van de Peer, P. Dupont, and Y. Saeys, “Robust 
biomarker identification for cancer diagnosis with ensemble feature selection 
methods,” Bioinformatics, vol. 26, no. 3, pp. 392–398, 2009. 
 106 
 
 
[33] N. Meinshausen and P. Bühlmann, “Stability selection,” Journal of the Royal 
Statistical Society: Series B (Statistical Methodology), vol. 72, no. 4, pp. 417–473, 
2010. 
[34] M. Belgiu and L. Drăguţ, “Random forest in remote sensing: A review of 
applications and future directions,” ISPRS Journal of Photogrammetry and Remote 
Sensing, vol. 114, pp. 24–31, 2016. 
[35] A. K. Das, S. Goswami, A. Chakrabarti, and B. Chakraborty, “A new hybrid 
feature selection approach using feature association map for supervised and 
unsupervised classification,” Expert Systems with Applications, vol. 88, pp. 81–94, 
2017. 
[36] C. Strobl, A.-L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis, 
“Conditional variable importance for random forests,” BMC bioinformatics, vol. 9, no. 
1, p. 307, 2008. 
[37] L. Toloşi and T. Lengauer, “Classification with correlated features: 
unreliability of feature ranking and solutions,” Bioinformatics, vol. 27, no. 14, pp. 
1986–1994, 2011. 
[38] M. Y. Park, T. Hastie, and R. Tibshirani, “Averaged gene expressions for 
regression,” Biostatistics, vol. 8, no. 2, pp. 212–227, 2006. 
[39] G. Gan and M. K.-P. Ng, “Subspace clustering with automatic feature 
grouping,” Pattern Recognition, vol. 48, no. 11, pp. 3703–3713, 2015. 
[40] R. B. Randall and J. Antoni, “Rolling element bearing diagnostics—A 
tutorial,” Mechanical systems and signal processing, vol. 25, no. 2, pp. 485–520, 
2011. 
 107 
 
 
[41] J. Wang, Y. Peng, and W. Qiao, “Current-aided order tracking of vibration 
signals for bearing fault diagnosis of direct-drive wind turbines,” IEEE Transactions 
on Industrial Electronics, vol. 63, no. 10, pp. 6336–6346, 2016. 
[42] B. L. Song and J. Lee, “Framework of designing an adaptive and multi-regime 
prognostics and health management for wind turbine reliability and efficiency 
improvement,” Framework, vol. 4, no. 2, 2013. 
[43] N. Sammaknejad, B. Huang, and Y. Lu, “Robust Diagnosis of Operating 
Mode Based on Time-Varying Hidden Markov Models,” IEEE Transactions on 
Industrial Electronics, vol. 63, no. 2, pp. 1142–1152, Feb. 2016. 
[44] N. Görnitz, M. M. Kloft, K. Rieck, and U. Brefeld, “Toward supervised 
anomaly detection,” Journal of Artificial Intelligence Research, vol. 46, no. 1, pp. 
235–262, 2013. 
[45] A. Soualhi, G. Clerc, and H. Razik, “Detection and diagnosis of faults in 
induction motor using an improved artificial ant clustering technique,” IEEE 
Transactions on Industrial Electronics, vol. 60, no. 9, pp. 4053–4062, 2013. 
[46] G. Lu, Y. Zhou, C. Lu, and X. Li, “A novel framework of change-point 
detection for machine monitoring,” Mechanical Systems and Signal Processing, vol. 
83, pp. 533–548, 2017. 
[47] J. Tian, M. H. Azarian, and M. Pecht, “Rolling element bearing fault detection 
using density-based clustering,” in Prognostics and Health Management (PHM), 
2014 IEEE Conference on, 2014, pp. 1–7. 
 108 
 
 
[48] R. Laxhammar and G. Falkman, “Online learning and sequential anomaly 
detection in trajectories,” IEEE transactions on pattern analysis and machine 
intelligence, vol. 36, no. 6, pp. 1158–1173, 2014. 
[49] J. Tian, M. H. Azarian, and M. Pecht, “Anomaly Detection Using Self-
Organizing Maps-Based K-Nearest Neighbor Algorithm,” in Proceedings of the 
European Conference of the Prognostics and Health Management Society, 2014. 
[50] G. You, S. Park, and D. Oh, “Diagnosis of Electric Vehicle Batteries Using 
Recurrent Neural Networks,” IEEE Transactions on Industrial Electronics, 2017. 
[51] W. Du, J. Tao, Y. Li, and C. Liu, “Wavelet leaders multifractal features based 
fault diagnosis of rotating mechanism,” Mechanical Systems and Signal Processing, 
vol. 43, no. 1–2, pp. 57–75, 2014. 
[52] Y. Li, M. Xu, R. Wang, and W. Huang, “A fault diagnosis scheme for rolling 
bearing based on local mean decomposition and improved multiscale fuzzy entropy,” 
Journal of Sound and Vibration, vol. 360, pp. 277–299, 2016. 
[53] Y. Li, M. Xu, H. Zhao, and W. Huang, “Hierarchical fuzzy entropy and 
improved support vector machine based binary tree approach for rolling bearing fault 
diagnosis,” Mechanism and Machine Theory, vol. 98, pp. 114–132, 2016. 
[54] M. S. Safizadeh and S. K. Latifi, “Using multi-sensor data fusion for vibration 
fault diagnosis of rolling element bearings by accelerometer and load cell,” 
Information Fusion, vol. 18, pp. 1–8, 2014. 
[55] J. B. Ali, N. Fnaiech, L. Saidi, B. Chebel-Morello, and F. Fnaiech, 
“Application of empirical mode decomposition and artificial neural network for 
 109 
 
 
automatic bearing fault diagnosis based on vibration signals,” Applied Acoustics, vol. 
89, pp. 16–27, 2015. 
[56] M. Gan, C. Wang, and C. Zhu, “Construction of hierarchical diagnosis 
network based on deep learning and its application in the fault pattern recognition of 
rolling element bearings,” Mechanical Systems and Signal Processing, vol. 72–73, pp. 
92–104, 2016. 
[57] X. Jin, M. Zhao, T. W. S. Chow, and M. Pecht, “Motor Bearing Fault 
Diagnosis Using Trace Ratio Linear Discriminant Analysis,” IEEE Transactions on 
Industrial Electronics, vol. 61, no. 5, pp. 2441–2451, May 2014. 
[58] N. Zhang, L. Wu, J. Yang, and Y. Guan, “Naive Bayes Bearing Fault 
Diagnosis Based on Enhanced Independence of Data,” Sensors, vol. 18, no. 2, p. 463, 
2018. 
[59] H. Zhou, J. Chen, G. Dong, and R. Wang, “Detection and diagnosis of bearing 
faults using shift-invariant dictionary learning and hidden Markov model,” 
Mechanical systems and signal processing, vol. 72, pp. 65–79, 2016. 
[60] J. Phillips, E. Cripps, J. W. Lau, and M. R. Hodkiewicz, “Classifying 
machinery condition using oil samples and binary logistic regression,” Mechanical 
Systems and Signal Processing, vol. 60–61, pp. 316–325, Aug. 2015. 
[61] D. H. Pandya, S. H. Upadhyay, and S. P. Harsha, “Fault diagnosis of rolling 
element bearing by using multinomial logistic regression and wavelet packet 
transform,” Soft Computing, vol. 18, no. 2, pp. 255–266, 2014. 
[62] N. E. I. Karabadji, H. Seridi, I. Khelf, N. Azizi, and R. Boulkroune, 
“Improved decision tree construction based on attribute selection and data sampling 
 110 
 
 
for fault diagnosis in rotating machines,” Engineering Applications of Artificial 
Intelligence, vol. 35, pp. 71–83, Oct. 2014. 
[63] Z. Wang, Q. Zhang, J. Xiong, M. Xiao, G. Sun, and J. He, “Fault Diagnosis of 
a Rolling Bearing Using Wavelet Packet Denoising and Random Forests,” IEEE 
Sensors Journal, vol. 17, no. 17, pp. 5581–5588, Sep. 2017. 
[64] M. Cerrada, G. Zurita, D. Cabrera, R.-V. Sánchez, M. Artés, and C. Li, “Fault 
diagnosis in spur gears based on genetic algorithm and random forest,” Mechanical 
Systems and Signal Processing, vol. 70–71, pp. 87–103, 2016. 
[65] Y. Lei, M. J. Zuo, Z. He, and Y. Zi, “A multidimensional hybrid intelligent 
method for gear fault diagnosis,” Expert Systems with Applications, vol. 37, no. 2, pp. 
1419–1430, Mar. 2010. 
[66] J. Tian, M. H. Azarian, M. Pecht, G. Niu, and C. Li, “An ensemble learning-
based fault diagnosis method for rotating machinery,” in Prognostics and System 
Health Management Conference (PHM-Harbin), 2017, 2017, pp. 1–6. 
[67] R. Domingues, M. Filippone, P. Michiardi, and J. Zouaoui, “A comparative 
evaluation of outlier detection algorithms: Experiments and analyses,” Pattern 
Recognition, vol. 74, pp. 406–421, 2018. 
[68] S. Fu, K. Liu, Y. Xu, and Y. Liu, “Rolling Bearing Diagnosing Method Based 
on Time Domain Analysis and Adaptive Fuzzy -Means Clustering,” Shock and 
Vibration, 2016. [Online]. Available: 
https://www.hindawi.com/journals/sv/2016/9412787/abs/. [Accessed: 13-Mar-2018]. 
[69] A. Rodríguez Ramos, O. Llanes-Santiago, J. M. Bernal de Lázaro, C. Cruz 
Corona, A. J. Silva Neto, and J. L. Verdegay Galdeano, “A novel fault diagnosis 
 111 
 
 
scheme applying fuzzy clustering algorithms,” Applied Soft Computing, vol. 58, pp. 
605–619, Sep. 2017. 
[70] A. Soualhi, G. Clerc, and H. Razik, “Detection and Diagnosis of Faults in 
Induction Motor Using an Improved Artificial Ant Clustering Technique,” IEEE 
Transactions on Industrial Electronics, vol. 60, no. 9, pp. 4053–4062, Sep. 2013. 
[71] Z. Wei, Y. Wang, S. He, and J. Bao, “A novel intelligent method for bearing 
fault diagnosis based on affinity propagation clustering and adaptive feature 
selection,” Knowledge-Based Systems, vol. 116, pp. 1–12, Jan. 2017. 
[72] J. Li, R. Zhao, and L. Deng, “Application of EWT AR model and FCM 
clustering in rolling bearing fault diagnosis,” in 2017 Prognostics and System Health 
Management Conference (PHM-Harbin), 2017, pp. 1–6. 
[73] J. Tian, M. H. Azarian, and M. Pecht, “Rolling element bearing fault detection 
using density-based clustering,” in 2014 International Conference on Prognostics and 
Health Management, 2014, pp. 1–7. 
[74] C. T. Yiakopoulos, K. C. Gryllias, and I. A. Antoniadis, “Rolling element 
bearing fault detection in industrial environments based on a K-means clustering 
approach,” Expert Systems with Applications, vol. 38, no. 3, pp. 2888–2911, Mar. 
2011. 
[75] M. Inacio, A. Lemos, and W. Caminhas, “Fault diagnosis with evolving fuzzy 
classifier based on clustering algorithm and drift detection,” Mathematical Problems 
in Engineering, vol. 2015, 2015. 
[76] J. Tian, C. Morillo, M. H. Azarian, and M. Pecht, “Motor bearing fault 
detection using spectral kurtosis-based feature extraction coupled with K-nearest 
 112 
 
 
neighbor distance analysis,” IEEE Transactions on Industrial Electronics, vol. 63, no. 
3, pp. 1793–1803, 2016. 
[77] D. Fernández-Francos, D. Martínez-Rego, O. Fontenla-Romero, and A. 
Alonso-Betanzos, “Automatic bearing fault diagnosis based on one-class ν-SVM,” 
Computers & Industrial Engineering, vol. 64, no. 1, pp. 357–365, Jan. 2013. 
[78] S. Yin, X. Zhu, and C. Jing, “Fault detection based on a robust one class 
support vector machine,” Neurocomputing, vol. 145, pp. 263–268, Dec. 2014. 
[79] D. Martínez-Rego, O. Fontenla-Romero, A. Alonso-Betanzos, and J. C. 
Principe, “Fault detection via recurrence time statistics and one-class classification,” 
Pattern Recognition Letters, vol. 84, pp. 8–14, Dec. 2016. 
[80] J. Yu, “Bearing performance degradation assessment using locality preserving 
projections and Gaussian mixture models,” Mechanical Systems and Signal 
Processing, vol. 25, no. 7, pp. 2573–2588, 2011. 
[81] E. B. Martin and A. J. Morris, “Non-parametric confidence bounds for process 
performance monitoring charts,” Journal of Process Control, vol. 6, no. 6, pp. 349–
358, 1996. 
[82] Y. Pan, J. Chen, and X. Li, “Bearing performance degradation assessment 
based on lifting wavelet packet decomposition and fuzzy c-means,” Mechanical 
Systems and Signal Processing, vol. 24, no. 2, pp. 559–566, 2010. 
[83] R. Huang, L. Xi, X. Li, C. R. Liu, H. Qiu, and J. Lee, “Residual life 
predictions for ball bearings based on self-organizing map and back propagation 
neural network methods,” Mechanical Systems and Signal Processing, vol. 21, no. 1, 
pp. 193–207, 2007. 
 113 
 
 
[84] J. Tian, M. H. Azarian, and M. Pecht, “Anomaly Detection Using Self-
Organizing Maps-Based K-Nearest Neighbor Algorithm,” in Proceedings of the 
European Conference of the Prognostics and Health Management Society, 2014. 
[85] G. Georgoulas, T. Loutas, C. D. Stylios, and V. Kostopoulos, “Bearing fault 
detection based on hybrid ensemble detector and empirical mode decomposition,” 
Mechanical Systems and Signal Processing, vol. 41, no. 1, pp. 510–525, 2013. 
[86] J. Tian, M. H. Azarian, and M. Pecht, “Rolling element bearing fault detection 
using density-based clustering,” in Prognostics and Health Management (PHM), 
2014 IEEE Conference on, 2014, pp. 1–7. 
[87] J. Tian, C. Morillo, and M. G. Pecht, “Rolling element bearing fault diagnosis 
using simulated annealing optimized spectral kurtosis,” in 2013 IEEE Conference on 
Prognostics and Health Management (PHM), 2013, pp. 1–5. 
[88] J. Tian, M. H. Azarian, M. Pecht, G. Niu, and C. Li, “An ensemble learning-
based fault diagnosis method for rotating machinery,” in Prognostics and System 
Health Management Conference (PHM-Harbin), 2017, 2017, pp. 1–6. 
[89] M. Forina, “An extendible package for data exploration, classification and 
correlation,” Institute of Pharmaceutical and Food Analisys and Technologies, Via 
Brigata Salerno, vol. 16147, 1991. 
[90] Z. Xia, S. Xia, L. Wan, and S. Cai, “Spectral regression based fault feature 
extraction for bearing accelerometer sensor signals,” Sensors, vol. 12, no. 10, pp. 
13694–13719, 2012. 
 114 
 
 
[91] O. Arbelaitz, I. Gurrutxaga, J. Muguerza, J. M. Pérez, and I. Perona, “An 
extensive comparative study of cluster validity indices,” Pattern Recognition, vol. 46, 
no. 1, pp. 243–256, 2013. 
[92] J. Yu, “Bearing performance degradation assessment using locality preserving 
projections and Gaussian mixture models,” Mechanical Systems and Signal 
Processing, vol. 25, no. 7, pp. 2573–2588, 2011. 
[93] D. J. Ketchen Jr and C. L. Shook, “The application of cluster analysis in 
strategic management research: an analysis and critique,” Strategic management 
journal, pp. 441–458, 1996. 
[94] P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and 
validation of cluster analysis,” Journal of computational and applied mathematics, 
vol. 20, pp. 53–65, 1987. 
[95] A. Gupta and P. S. Merchant, “Automated Lane Detection by K-means 
Clustering: A Machine Learning Approach,” Electronic Imaging, vol. 2016, no. 14, 
pp. 1–6, 2016. 
[96] R. Tibshirani, G. Walther, and T. Hastie, “Estimating the number of clusters 
in a data set via the gap statistic,” Journal of the Royal Statistical Society: Series B 
(Statistical Methodology), vol. 63, no. 2, pp. 411–423, 2001. 
[97] J. Jacques and C. Preda, “Model-based clustering for multivariate functional 
data,” Computational Statistics & Data Analysis, vol. 71, pp. 92–106, 2014. 
[98] W. Labeeuw and G. Deconinck, “Residential electrical load model based on 
mixture model clustering and Markov models,” IEEE Transactions on Industrial 
Informatics, vol. 9, no. 3, pp. 1561–1569, 2013. 
 115 
 
 
[99] E. Bechhoefer and A. P. Bernhard, “A generalized process for optimal 
threshold setting in HUMS,” in Aerospace Conference, 2007 IEEE, 2007, pp. 1–9. 
[100] J. A. Hartigan and P. M. Hartigan, “The dip test of unimodality,” The Annals 
of Statistics, pp. 70–84, 1985. 
[101] A. K. Jain, “Data clustering: 50 years beyond K-means,” Pattern recognition 
letters, vol. 31, no. 8, pp. 651–666, 2010. 
[102] K. Peng, K. Zhang, B. You, J. Dong, and Z. Wang, “A quality-based 
nonlinear fault diagnosis framework focusing on industrial multimode batch 
processes,” IEEE Transactions on Industrial Electronics, vol. 63, no. 4, pp. 2615–
2624, 2016. 
[103] E. Hancer and D. Karaboga, “A comprehensive survey of traditional, merge-
split and evolutionary approaches proposed for determination of cluster number,” 
Swarm and Evolutionary Computation, vol. 32, pp. 49–67, 2017. 
[104] W. M. Rand, “Objective criteria for the evaluation of clustering methods,” 
Journal of the American Statistical association, vol. 66, no. 336, pp. 846–850, 1971. 
[105] H.-S. Park and C.-H. Jun, “A simple and fast algorithm for K-medoids 
clustering,” Expert systems with applications, vol. 36, no. 2, pp. 3336–3341, 2009. 
[106] A. Rosenberg and J. Hirschberg, “V-Measure: A Conditional Entropy-Based 
External Cluster Evaluation Measure.,” in EMNLP-CoNLL, 2007, vol. 7, pp. 410–420. 
[107] J. Bergstra and Y. Bengio, “Random search for hyper-parameter 
optimization,” Journal of Machine Learning Research, vol. 13, no. Feb, pp. 281–305, 
2012. 
 116 
 
 
[108] S. Simani, S. Farsoni, and P. Castaldi, “Fault diagnosis of a wind turbine 
benchmark via identified fuzzy models,” IEEE Transactions on Industrial Electronics, 
vol. 62, no. 6, pp. 3775–3782, 2015. 
[109] W. Qiao and D. Lu, “A survey on wind turbine condition monitoring and fault 
diagnosis—Part I: Components and subsystems,” IEEE Transactions on Industrial 
Electronics, vol. 62, no. 10, pp. 6536–6545, 2015. 
[110] W. Qiao and D. Lu, “A survey on wind turbine condition monitoring and fault 
diagnosis—Part II: Signals and signal processing methods,” IEEE Transactions on 
Industrial Electronics, vol. 62, no. 10, pp. 6546–6557, 2015. 
[111] Y. Lei, J. Lin, M. J. Zuo, and Z. He, “Condition monitoring and fault 
diagnosis of planetary gearboxes: A review,” Measurement, vol. 48, pp. 292–305, 
2014. 
[112] S. T. Kandukuri, A. Klausen, H. R. Karimi, and K. G. Robbersmyr, “A review 
of diagnostics and prognostics of low-speed machinery towards wind turbine farm-
level health management,” Renewable and Sustainable Energy Reviews, vol. 53, pp. 
697–708, 2016. 
[113] Y. Wang, J. Xiang, R. Markert, and M. Liang, “Spectral kurtosis for fault 
detection, diagnosis and prognostics of rotating machines: A review with 
applications,” Mechanical Systems and Signal Processing, vol. 66, pp. 679–698, 2016. 
 
 
 
 
 117 
 
 
 
 
 
 118