ABSTRACT Title of Dissertation: Some Guidelines for Risk Assessment of Vulnerability Discovery Processes Yazdan Movahedi, Doctor of Philosophy, 2019 Dissertation directed by: Associate Professor Michel Cukier, Department of Mechanical Engineering Software vulnerabilities can be defined as software faults, which can be exploited as results of security attacks. Security researchers have used data from vulnerability databases to study trends of discovery of new vulnerabilities or propose models for fitting the discovery times and for predicting when new vulnerabilities may be discovered. Estimating the discovery times for new vulnerabilities is useful both for vendors as well as the end-users as it can help with resource allocation strategies over time. Among the research conducted on vulnerability modeling, only a few studies have tried to provide a guideline about which model should be used in a given situation. In other words, assuming the vulnerability data for a software is given, the research questions are the following: Is there any feature in the vulnerability data that could be used for identifying the most appropriate models for that dataset? What models are more accurate for vulnerability discovery process modeling? Can the total number of publicly-known exploited vulnerabilities be predicted using all vulnerabilities reported for a given software? To answer these questions, we propose to characterize the vulnerability discovery process using several common software reliability/vulnerability discovery models, also known as Software Reliability Models (SRMs)/Vulnerability Discovery Models (VDMs). We plan to consider different aspects of vulnerability modeling including curve fitting and prediction. Some existing SRMs/VDMs lack accuracy in the prediction phase. To remedy the situation, three strategies are considered: (1) Finding a new approach for analyzing vulnerability data using common models. In other words, we examine the effect of data manipulation techniques (i.e. clustering, grouping) on vulnerability data, and investigate whether it leads to more accurate predictions. (2) Developing a new model that has better curve filling and prediction capabilities than current models. (3) Developing a new method to predict the total number of publicly-known exploited vulnerabilities using all vulnerabilities reported for a given software. The dissertation is intended to contribute to the science of software reliability analysis and presents some guidelines for vulnerability risk assessment that could be integrated as part of security tools, such as Security Information and Event Management (SIEM) systems. SOME GUIDELINES FOR RISK ASSESSMENT OF VULNERABILITY DISCOVERY PROCESSES by Yazdan Movahedi Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park, in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2019 Advisory Committee: Associate Professor Michel Cukier, Chair Professor Rance Cleaveland, Dean’s Representative Professor Mohammad Modarres Professor Jeffrey W. Herrmann Assistant Professor Katrina Groth © Copyright by Yazdan Movahedi 2019 Acknowledgements Over the past four years, I have met many incredible people who have contributed to this significant personal and professional achievement. I would like to thank each and every one who has supported me, encouraged me and helped me throughout this path. First, I would like to thank my committee members Dr. Mohammad Modarres, Dr. Rance Cleaveland, Dr. Jeffrey Herrmann, and Dr. Katrina Groth for their insightful and valuable feedback. I am very thankful to my advisor, Dr. Michel Cukier, who gave me the opportunity to study at the University of Maryland. This dissertation would not have been possible without his guidance and support. I also had the great opportunity to collaborate with Dr. Ilir Gashi and his research team from the City University in London, over the course of my graduate studies. I would like to express my gratitude to Dr. Gashi who never withholds his advice and support in all the time of research. On a more personal note, moving to a foreign country was not always easy, but I was lucky to build many great friendships along the way: Mehdi, Sanaz, Ali, Daniel, Elaheh, Roohollah, Parastoo, Peyman, Miead, Fatimah, and Amin. I would like to thank my family. This journey would not have been possible without the support of my family. You have always been there when I needed you. I admire how supportive you are; you taught me to never give up, and it is thanks to you that I have made it to where I am now. ii Table of Contents Acknowledgements ....................................................................................................... ii Table of Contents ......................................................................................................... iii List of Tables ............................................................................................................... ix List of Figures .............................................................................................................. xi Chapter 1: Introduction ................................................................................................. 1 1.1 Background and Motivation ................................................................................ 1 1.2 Research Questions and Approaches .................................................................. 2 1.3 Contributions ....................................................................................................... 3 1.4 Dissertation Outline............................................................................................. 5 Chapter 2: Literature Review ........................................................................................ 6 2.1 Vulnerability Databases ...................................................................................... 6 2.1.1 NVD database ............................................................................................... 6 2.1.2 CVE database ............................................................................................... 8 2.1.3 SecurityFocus ............................................................................................... 9 2.1.4 CXSecurity/WLB2 ....................................................................................... 9 2.1.3 Exploit database (EDB) .............................................................................. 10 2.2 Vulnerability Risk Assessment and Modeling: Software Level ....................... 10 2.3 Vulnerability Risk Assessment and Modeling: Vulnerability Level ................ 13 2.3.1 Based on source code ................................................................................. 13 2.3.2 Based on vulnerability lifecycle ................................................................. 15 2.3.3. Based on CVSS metrics ............................................................................. 16 2.3.4. Based on system calls ................................................................................ 16 2.4 Methods of Analysis & Risk Assessment Strategies ........................................ 17 2.4.1 Cluster-based analysis ................................................................................ 17 iii 2.4.2. Machine learning ....................................................................................... 19 2.4.3. Optimal patch planning ............................................................................. 20 2.5. Guidelines for Vulnerability Discovery Models .............................................. 21 Chapter 3: Datasets and Models ................................................................................. 23 3.1 Introduction ....................................................................................................... 23 3.2 Vulnerability Dataset Creation .......................................................................... 23 3.3 Vulnerability Dataset Overview ........................................................................ 24 3.3.1 Operating systems....................................................................................... 25 3.3.2 Web browsers ............................................................................................. 25 3.4 Datasets Characterization .................................................................................. 26 3.5 S-shaped Vulnerability Discovery Models ....................................................... 27 3.5.1. Gamma-based VDM .................................................................................. 29 3.5.2 Weibull-based VDM................................................................................... 30 3.5.3 AML VDM ................................................................................................. 31 3.5.4 Normal-based VDM ................................................................................... 31 3.5.5 Younis Folded VDM .................................................................................. 32 3.6 Non-S-shaped Vulnerability Discovery Models ............................................... 33 3.6.1 Power-law Software Reliability Growth Models (SRGM-based) .............. 33 3.6.2 Rescorla Exponential (RE) ......................................................................... 34 3.6.3 Rescorla Quadratic (RQ) ............................................................................ 35 3.7 Summary ........................................................................................................... 35 Chapter 4: Non Cluster-based Vulnerability Assessment ........................................... 36 4.1 Introduction ....................................................................................................... 36 4.2 Motivation ......................................................................................................... 36 4.3 Analysis ............................................................................................................. 37 iv 4.3.1 Curve-fitting error indicators ...................................................................... 38 4.3.2 Prediction error indicators .......................................................................... 39 4.4 Curve-Fitting Results ........................................................................................ 40 4.4.1 Operating systems....................................................................................... 40 4.4.2 Web browsers ............................................................................................. 43 4.3.4 Summary of Estimation Results ................................................................. 45 4.5 Prediction Results .............................................................................................. 45 4.5.1 Operating systems....................................................................................... 46 4.5.2 Web browsers ............................................................................................. 47 4.5.4 Summary of prediction results .................................................................... 49 4.6 Discussion ......................................................................................................... 49 4.7 Limitations ........................................................................................................ 50 4.8 Summary ........................................................................................................... 51 Chapter 5: Clustering .................................................................................................. 52 5.1 Introduction ....................................................................................................... 52 5.2 Motivation ......................................................................................................... 52 5.3 Data Processing ................................................................................................. 53 5.4 Clustering Method ............................................................................................. 55 5.4.1 Operating systems....................................................................................... 56 5.4.2 Web browsers ............................................................................................. 56 5.5 Analysis ............................................................................................................. 57 5.6 Curve-Fitting Results ........................................................................................ 59 5.6.1 Operating systems....................................................................................... 59 5.6.2 Web browsers ............................................................................................. 61 5.6.3 Summary of Curve-Fitting Results ............................................................. 63 v 5.7 Prediction Results .............................................................................................. 63 5.7.1 Operating systems....................................................................................... 64 5.7.2 Web browsers ............................................................................................. 66 5.7.3 Summary of Prediction Results .................................................................. 67 5.8 Discussion ......................................................................................................... 68 5.9 Limitations ........................................................................................................ 69 5.10 Summary ......................................................................................................... 70 Chapter 6: A Comparison of Vulnerabilities’ Grouping Strategies ............................ 71 6.1 Introduction ....................................................................................................... 71 6.2 Motivation ......................................................................................................... 71 6.3 Grouping Strategy ............................................................................................. 72 6.4 Analysis ............................................................................................................. 74 6.4.1 Curve-fitting error indicators ...................................................................... 75 6.4.2 Prediction error indicators .......................................................................... 76 6.5 Curve-Fitting Results ........................................................................................ 76 6.5.1 Operating systems....................................................................................... 76 6.5.2 Web browsers ............................................................................................. 78 6.5.3 Summary of curve-fitting results ................................................................ 79 6.6 Prediction Results .............................................................................................. 80 6.6.1 Operating systems....................................................................................... 80 6.6.2 Web browsers ............................................................................................. 81 6.6.3 Summary of prediction results .................................................................... 83 6.7 Discussion ......................................................................................................... 83 6.8 Limitations ........................................................................................................ 85 6.9 Summary ........................................................................................................... 86 vi Chapter 7: Vulnerability Prediction Capability: A Comparison between Vulnerability Discovery Models and Neural Network Models ........................................................ 88 7.1 Introduction ....................................................................................................... 88 7.2 Motivation ......................................................................................................... 88 7.3 Data Processing ................................................................................................. 89 7.4 Neural Network Model (NNM)......................................................................... 92 7.5 Analysis ............................................................................................................. 96 7.6 Results ............................................................................................................... 97 7.7 Discussion ....................................................................................................... 101 7.8 Limitations ...................................................................................................... 103 7.9 Summary ......................................................................................................... 104 Chapter 8: Predicting Exploited Vulnerabilities ....................................................... 105 8.1 Introduction ..................................................................................................... 105 8.2 Motivation ....................................................................................................... 105 8.3 Data Processing ............................................................................................... 107 8.4 Analytical steps of scenario S1 ....................................................................... 109 8.4.1 For VDMs ................................................................................................. 109 8.4.2 For the NNM ............................................................................................ 111 8.5 Analysis ........................................................................................................... 112 8.6 Results ............................................................................................................. 114 8.6.1 Summary of Results.................................................................................. 122 8.7 Discussion ....................................................................................................... 122 8.8 Limitations ...................................................................................................... 123 8.9 Summary ......................................................................................................... 124 Chapter 9: Proposed Future Work and Summary of Completed Work .................... 126 vii 9.1 Introduction ..................................................................................................... 126 9.2 Summary of the research questions and contributions .................................... 126 9.2.1 Summary of dissertation and research questions ...................................... 126 9.2.2 Summary of contributions ........................................................................ 127 9.3 Summary of Published Work .......................................................................... 129 9.3.1 Published work ......................................................................................... 129 9.3.2 Additional completed work ...................................................................... 130 9.4 Future Work .................................................................................................... 131 Appendices ................................................................................................................ 132 Appendix A: Clustering Tables ............................................................................. 132 Bibliography ............................................................................................................. 139 viii List of Tables TABLE 1: NUMBER OF VULNERABILITIES PER SOFTWARE ..............................................26 TABLE 2: SKEWNESS VALUES PER SOFTWARE ...............................................................27 TABLE 3: CURVE FITTING ACCURACY FOR OSS ..............................................................42 TABLE 4: CURVE FITTING ACCURACY FOR WEB BROWSERS ...........................................43 TABLE 5: PREDICTION ACCURACY FOR OSS ...................................................................46 TABLE 6: PREDICTION ACCURACY FOR WEB BROWSERS ................................................48 TABLE 7: CURVE FITTING ACCURACY FOR WINDOWS ....................................................60 TABLE 8: CURVE FITTING ACCURACY FOR MAC ............................................................60 TABLE 9: CURVE FITTING ACCURACY FOR IOS ..............................................................60 TABLE 10: CURVE FITTING ACCURACY FOR LINUX ........................................................61 TABLE 11: CURVE FITTING ACCURACY FOR IE ...............................................................62 TABLE 12: CURVE FITTING ACCURACY FOR SAFARI ......................................................62 TABLE 13: CURVE FITTING ACCURACY FOR FIREFOX ....................................................62 TABLE 14: CURVE FITTING ACCURACY FOR CHROME ....................................................62 TABLE 15: PREDICTION ACCURACY FOR WINDOWS .......................................................65 TABLE 16: PREDICTION ACCURACY FOR MAC ...............................................................65 TABLE 17: PREDICTION ACCURACY FOR IOS ..................................................................65 TABLE 18: PREDICTION ACCURACY FOR LINUX .............................................................65 TABLE 19: PREDICTION ACCURACY FOR IE ....................................................................66 TABLE 20: PREDICTION ACCURACY FOR SAFARI ...........................................................66 TABLE 21: PREDICTION ACCURACY FOR FIREFOX .........................................................67 TABLE 22: PREDICTION ACCURACY FOR CHROME .........................................................67 TABLE 23: MODELING GUIDELINE ................................................................................69 TABLE 24: PERCENTAGE OF COMMON VULNERABILITIES WITHIN FIREFOX VERSIONS.. .73 TABLE 25: NUMBER OF VULNERABILITIES PER SOFTWARE ...........................................74 TABLE 26: CURVE FITTING ACCURACY FOR OSS (ST.1) ................................................77 TABLE 27: CURVE FITTING ACCURACY FOR OSS (ST.2) ................................................77 TABLE 28: CURVE FITTING ACCURACY FOR WEB BROWSERS (ST.1) ..............................78 TABLE 29: CURVE FITTING ACCURACY FOR WEB BROWSERS (ST.2) ..............................79 ix TABLE 30: PREDICTION ACCURACY FOR WEB BROWSERS (ST.1) ...................................81 TABLE 31: PREDICTION ACCURACY FOR OSS (ST.2) ....................................................81 TABLE 32: PREDICTION ACCURACY FOR WEB BROWSERS (ST.1) ...................................82 TABLE 33: PREDICTION ACCURACY FOR WEB BROWSERS (ST.2) ...................................82 TABLE 34: SUMMARY OF SELECTED MODELS PER DATASET (CURVE-FITTING).............84 TABLE 35: SUMMARY OF SELECTED MODELS PER DATAS ET (PREDICTION) ...................84 TABLE 36: NUMBER OF VULNERABILITIES PER SOFTWARE ...........................................90 TABLE 37: PREDICTION ACCURACY FOR OSS (VDMS & NNM) ...................................98 TABLE 38: PREDICTION ACCURACY FOR WEB BROWSERS (VDMS & NNM) ...............99 TABLE 39: NUMBER OF VULNERABILITIES PER SOFTWARE (ALL VS. EXPLOITED) ......108 TABLE 40: TABLE OF TTVN MEAN RATIOS PER SOFTWARE ........................................111 TABLE 41: PREDICTION ACCURACY FOR OSS PER SCENARIO (VDMS & NNM) .........116 TABLE 42: PREDICTION ACCURACY FOR WEB BROWSERS PER SCENARIO (VDMS & NNM)….. .......................................................................................................118 TABLE 43: NUMBER OF VULNERABILITIES PER OS .......................................................132 TABLE 44: NUMBER OF VULNERABILITIES PER TYPE AND OS.......................................132 TABLE 45: NUMBER OF VULNERABILITIES PER TYPE, CLUSTER (WINDOWS) ................133 TABLE 46: NUMBER OF VULNERABILITIES PER TYPE, CLUSTER (MAC) ........................133 TABLE 47: NUMBER OF VULNERABILITIES PER TYPE, CLUSTER (IOS) ...........................134 TABLE 48: NUMBER OF VULNERABILITIES PER TYPE, CLUSTER (LINUX) ......................134 TABLE 49: CLUSTER COMPOSITION FOR OSS ...............................................................135 TABLE 50: NUMBER OF VULNERABILITIES PER WEB BROWSER ....................................135 TABLE 51: NUMBER OF VULNERABILITIES PER TYPE AND WEB BROWSER ...................136 TABLE 52: NUMBER OF VULNERABILITIES PER TYPE, CLUSTER (INTERNET EXPLORER) …………........................................................................................................136 TABLE 53: NUMBER OF VULNERABILITIES PER TYPE, CLUSTER (SAFARI) ....................137 TABLE 54: NUMBER OF VULNERABILITIES PER TYPE, CLUSTER (FIREFOX)...................137 TABLE 55: NUMBER OF VULNERABILITIES PER TYPE, CLUSTER (CHROME) ..................138 TABLE 56: CLUSTER COMPOSITION FOR WEB BROWSERS ............................................138 x List of Figures OVERALL VIEW OF A DISTRIBUTION WITH (A) NEGATIVE SKEWNESS, (B) APPOXIMATELY ZERO SKEWNESS, AND (C) POSITIVE SKEWNESS ........................27 THREE PHASES FOR S-SHAPED MODELS ........................................................29 FITTED MODELS FOR OPERATING SYSTEMS ...................................................41 FITTED MODELS FOR WEB BROWSERS ...........................................................44 NORMALIZED PREDICTION ERROR VALUES FOR THE MODELS (OSS) .............47 NORMALIZED PREDICTION ERROR VALUES FOR THE MODELS (WEB BROWSERS) …………..........................................................................................................48 DIAGRAM OF THE PRESENTED CLUSTERING APPROACH.................................54 HISTOGRAM OF THE NUMBER OF DETECTED VULNERABILITIES PER 30 DAYS TOGETHER WITH ITS 180-DAYS MOVING AVERAGE FOR THE STUDIED OSS.. ......91 HISTOGRAM OF THE NUMBER OF DETECTED VULNERABILITIES PER 30 DAYS TOGETHER WITH ITS 180-DAYS MOVING AVERAGE FOR THE STUDIED WEB BROWSERS.. ......................................................................................................91 THE NNM ARCHITECTURE USED FOR OUR STUDY ....................................93 PREDICTION ERRORS FOR OSS. THE X-AXIS INDICATES TIME (YEAR). THE Y- AXIS REPRESENTS NORMALIZED PREDICTION ERROR VALUES ((Ω𝑡 − Ω)/Ω). ...99 PREDICTION ERRORS FOR WEB BROWSERS.. ..............................................100 PERCENTAGE OF EXPLOITED VULNERABILITIES PER SOFTWARE ................109 BOX CHART FOR TTNV COEFFICIENT RATIO PER OS (S2/S1) ...................110 BOX CHART FOR TTNV COEFFICIENT RATIO PER WEB BROWSER (S2/S1) .111 PREDICTION ERRORS FOR OSS PER SCENARIO.. .........................................117 PREDICTION ERRORS FOR WEB BROWSERS PER SCENARIO.. .......................119 xi Chapter 1: Introduction 1.1 Background and Motivation Vulnerabilities are software faults, that have the potential to be exploited as results of security attacks [1]. The process of vulnerability risk assessment can be investigated from two different perspectives: (1) based upon the influence vulnerabilities have on software, or (2) based upon the risk which is associated with a single vulnerability [2]. At the software level, the risk of detecting new vulnerabilities can be assessed through determining the number of vulnerabilities that are going to be detected over time. Even though public vulnerability resources such as the common vulnerability exposures (CVE), national vulnerability database (NVD), and open source vulnerability database (OSVDB) exist, estimating the total number of vulnerabilities associated with a software is still difficult. Software reliability models (SRMs) are well known and have been studied for over 40 years [3]. These models have been widely used for assessing the risk associated with vulnerabilities because vulnerabilities are software faults that can be exploited as the result of security attacks. Research has been conducted to create a link between the fault discovery process and the vulnerability discovery process for modeling purposes [4]. Thus, vulnerability discovery models (VDMs) and Software Reliability Models (SRMs) can be considered similar based on the fault detection processes [1]. Researchers have used data from various vulnerability databases to study trends of discovery of new vulnerabilities and used various models for predicting when new 1 vulnerabilities may be discovered [1], [3], [5]–[8]. Several studies have proposed new SRMs/VDMs or applied existing models to estimate security indicators such as total number of residual vulnerabilities in the system, time to next vulnerability (TTNV), vulnerability detection rate, etc. [1], [3]–[12]. Another approach for analyzing vulnerabilities is to find the risk associated with each vulnerability to help companies in making decisions with respect to the severity level of vulnerabilities and determining priorities for patching vulnerabilities. It is critical to rank the severity and exploitability of vulnerabilities since companies use such information to allocate their resources. Overall, estimating the discovery times for new vulnerabilities is useful both for vendors and for the end-users as it can help with their resource allocation strategies over time. 1.2 Research Questions and Approaches RQ1: Are there any features in the vulnerability data that could be used for identifying the most appropriate models for that dataset? There are few studies that have tried to provide guidance about which model should be used in a given situation. We address this problem by applying common VDMs on the vulnerabilities associated with eight software and compared their prediction accuracy. RQ2: What models are more accurate for vulnerability discovery process modeling? The lack of prediction accuracy is another issue regarding reliable vulnerability risk assessment. Two strategies can be considered: (1) Finding a new approach for analyzing vulnerability data still using common VDMs (2) Introducing a 2 new approach with better prediction capabilities. In this research, we examine both strategies: - We examine the effect of a data manipulation techniques (i.e. clustering, grouping) on vulnerability data, and investigate whether this approach leads to more accurate predictions compared to those without using clustering. - We introduce a new approach using a neural network model with more accurate predictions than those from common VDMs. RQ3: Can the total number of publicly-known exploited vulnerabilities be predicted using all vulnerabilities reported for a given software? Exploited vulnerabilities are vulnerabilities, which were exploited as the result of a security attack. They typically form a small portion of all the vulnerabilities reported for a given software (generally from 2-7% per software version). A huge proportion of the vulnerabilities may never exploited over the lifetime of a software. With such a small number of known exploited vulnerabilities compared to the total number of vulnerabilities, it is difficult to mathematically model and predict when a vulnerability with a known exploit will be reported. We study this issue by introducing an approach for predicting the total number of publicly-known exploited vulnerabilities using all publicly-known vulnerabilities reported for a given software. 1.3 Contributions In this thesis, we present some guidelines to model the vulnerability discovery process of a given vulnerability dataset. We compare the model fitting and prediction capabilities of eight models: one right-skewed distribution model, one flexible-skewed distribution model, three symmetric distribution models, one Power-law model, one 3 exponential model, and one quadratic model. We use two different data processing approaches. In the first approach, all the vulnerabilities are considered together, while in the second approach they are initially clustered and then modeled. We calculate the accuracy for each model’s fitting and prediction capabilities and analyze the average bias of the models (i.e., whether the models were overestimating or underestimating the number of vulnerabilities). We then use the eight VDMs to compare their prediction capabilities with the ones of a neural network model. We also study the link between publicly-known disclosure times of exploited vulnerabilities and all publicly-known vulnerabilities reported. Using this link, we mathematically model and predict when a vulnerability with a known exploit will be reported. We apply the models on eight datasets: four datasets of well-known operating systems (i.e., Windows, Mac, IOS and Linux) and four datasets of well-known web browsers (i.e., Internet Explorer, Safari, Firefox and Chrome). Our results showed that, given all the uncertainties associated with our datasets:  Considering only VDMs, in terms of prediction accuracy, our proposed clustering approach led to more accurate results in 58% of the cases, while the commonly used approach (without clustering) resulted in more accurate results in 42% cases.  Our presented a new modeling approach using neural networks provides higher curve fitting and prediction capabilities than current VDMs. 4  Our proposed approach, which predicts the total number of publicly- known exploited vulnerabilities using all publicly-known vulnerabilities reported for a given software has higher prediction capabilities than current VDMs. 1.4 Dissertation Outline The rest of the dissertation is organized as follows: Chapter 2 describes the related work. Chapter 3 describes the datasets and models that we used in our research. Chapter 4 applies the models without using clustering and compares them in terms of curve-fitting and prediction capabilities. Chapter 5 describes the clustering technique and a descriptive analysis of the vulnerability datasets after being clustered. It also presents the results of using the clustered datasets with the models for fitting and prediction respectively and provides a comparison between the capabilities with clustered and non-clustered data. Chapter 6 discusses the effect of another data manipulation technique (vulnerability grouping) on prediction capabilities of the VDMs. Chapter 7 presents a new modeling approach using neural networks and evaluates its prediction capability versus VDMs. Chapter 8 presents a new approach for modeling and predicting the total number of publicly-known exploited vulnerabilities using all publicly-known vulnerabilities reported. Chapter 9 discusses future work, summarizes the work completed, and concludes. 5 Chapter 2: Literature Review 2.1 Vulnerability Databases Several publicly available vulnerability databases and security advisories exist such as the National Vulnerabilities Database (NVD), Common Vulnerabilities, Exposures (CVE), etc. They provide vulnerability features including Common Vulnerabilities and Enumeration (CVE) identifiers, severity, Common Vulnerability Scoring System (CVSS) scores, published date, patch date, discovery date, and vulnerability type. However, there is no ideal source for vulnerabilities since they usually overlap and complement each other. Then, as a solution, it is better to use a combination of datasets [5]. 2.1.1 NVD database The National Vulnerability Database (NVD) is a public database and is commonly used for research on vulnerability discovery modeling. The NVD, by providing official information about all pre-detected computer vulnerabilities, helps to successfully inform and warn the public about existing vulnerabilities. Since its introduction in 1997, information associated with more than 43,000 software vulnerabilities affecting more than 17,000 software applications has been published by NVD. This valuable information is a great help in understanding trends and detecting patterns in software vulnerabilities, so that security decision makers could better monitor the security of computer systems that are pestered by the ubiquitous software security flaws [13]. 6 However, the NVD comes with some shortcomings such as chronological inconsistency (there is not a unique published date for a given vulnerability among the public repositories), incomplete inclusion (it does not include every detected vulnerability), lack of documentation and repetitive records of a single discovery event. These issues may have been derived from the fact that the NVD was not configured with vulnerability modeling needs in mind [5]. Each vulnerability entry in NVD consists of some fields associated with that vulnerability including Common Platform Enumeration (CPE), and Common Vulnerability Scoring System (CVSS). Based on [13], CPE is defined as “an open framework for communicating the characteristics and impacts of IT vulnerabilities, which provides us with information on a piece of software, including version, edition, language”. CVSS is a scoring system designed to provide a standardized mechanism for evaluating the risk associated with vulnerabilities. Communicating the base, temporal and environmental properties of a vulnerability helps organizations rate the risk associated with that vulnerability. Some components of the CVSS vector are as follows [13]: - Access Complexity: the difficulty level of the attack required to exploit the vulnerability - Authentication: indicates whether authenticate is required in order to exploit a vulnerability - Confidentiality, Integrity and Availability: These features are three loss types of attacks. Confidentiality loss indicates the condition when information is leaked to people who are not supposed to know it. Integrity loss indicates the condition when 7 illegal modification of data is permitted. Availability loss refers to the situation when the compromised system is not capable of performing its predefined task or is crashed. The final CVSS Score is calculated based upon the mentioned features, with the goal of indicating the severity associated with a vulnerability. 2.1.2 CVE database The Common Vulnerabilities and Exposures (CVE) is another free to use source of vulnerabilities hosted by MITRE1. CVE is designed to allow vulnerability databases to be linked together, and to make the comparison of security tools and services more straightforward for users. The creation of this list dates to September 1999 [5]. According to the CVE's FAQ2, “CVE is a dictionary that provides definitions for publicly disclosed cybersecurity vulnerabilities and exposures. The goal of CVE is to make it easier to share data across separate vulnerability capabilities (tools, databases, and services) with these definitions”. In this list, every vulnerability is assigned an identification code known as Common Vulnerabilities and Enumeration Identifier (CVE ID). Thanks to these identifiers, the process of sharing data among network security databases and tools has become straightforward since they provide a baseline for evaluation of the security tools’ coverage [14]. This source of data includes many features about vulnerabilities that can be leveraged in our analysis. These features are usually derived from CVE entries. Each CVE entry in this database includes3: a CVE ID (i.e., "CVE-1999-0067", "CVE-2014-10001"), a brief description of the security vulnerability and at least one pertinent reference (i.e., vulnerability reports and 1 http://cve.mitre.org/ 2 http://cve.mitre.org/about/faqs.html 3 https://cve.mitre.org/about/index.html 8 advisories). All these information are assigned to a vulnerability by a CVE Numbering Authority (CNA). CNAs consist of authorized organizations from around the world eligible to assign CVE IDs to vulnerabilities affecting products within their disclosure, for inclusion in first-time public announcements of new vulnerabilities. 2.1.3 SecurityFocus SecurityFocus as a computer security portal is a home to the well-known Bugtraq mailing list. Based on SecurityFocus’s FAQ4, “BugTraq is a full disclosure moderated mailing list for the detailed discussion and announcement of computer security vulnerabilities: what they are, how to exploit them, and how to fix them”. Each vulnerability entry in this database includes: a Bugtraq ID, a CVE ID, a published date, a brief description of the security vulnerability, and at least one public reference. 2.1.4 CXSecurity/WLB2 World Laboratory of Bugtraq is another collection of information on data communications safety. Based on CXSecurity’s FAQ5, every single user can interact with the database and report a vulnerability. However, each safety note in verified by CXSecurity. Each vulnerability entry in this database includes: a Bugtraq ID, a CVE ID, a published date, a brief description of the security vulnerability, and at least one public reference. Each entry in this database includes: a Bugtraq ID, a CVE ID, a published date, a brief description of the security vulnerability, and at least one public reference. 4 https://www.securityfocus.com/archive/1/description 5 https:// https://cxsecurity.com/wlb/about/ 9 2.1.3 Exploit database (EDB) EDB records exploits and vulnerable software [15]. According to the explanations provided in EDB's official website6, “The Exploit Database is a CVE compliant archive of public exploits and corresponding vulnerable software, developed for use by penetration testers and vulnerability researchers.” It reports vulnerabilities for which there is at least a proof-of-concept exploit. The process of collecting proof- of-concept exploit data is more convenient than collecting data on actual attacks. Based on [16], “a proof-of-concept exploit is merely a byproduct of the so-called ‘responsible vulnerability disclosure’ process, whereby a security researcher that finds a vulnerability discloses it to the vendor alongside a proof-of-concept exploitation code that proves the existence of the vulnerability itself”. Most of EDB’s data are derived from Metasploit, a tool for creating and executing exploit code against a target machine. It provides a search utility that uses a CVE number to find vulnerabilities that have an exploit. In this repository, every vulnerability is assigned an identification code known as EDB Identifier (EDB ID), a CVE ID, an exploit date, and a description. 2.2 Vulnerability Risk Assessment and Modeling: Software Level Even though vulnerability resources such as the common vulnerability exposures (CVE), national vulnerability database (NVD), and open source vulnerability database (OSVDB) are available, estimating the total number of vulnerabilities in a software is still difficult. Software reliability models (SRMs) are 6 https://www.exploit-db.com/about-exploit-db 10 well known and has been studied for over 40 years [3]. Several studies have applied SRMs to estimate software security indicators like the total number of residual vulnerabilities, time to next vulnerability (TTNV), and vulnerability density [3], [6]– [8], [17], [18]. The earliest effort at modeling software reliability was a Markov birth-death model introduced by Hudson in 1967 [19]. A good overview of several SRMs that characterize the process of software defect-finding is provided in [3]. Recently, a few vulnerability discovery models (VDMs) have been proposed to estimate the number of total vulnerabilities in a given software/system. Since vulnerabilities are software faults which are exploited as a result of security attacks [1], VDMs and SRMs can be considered to be similar based on the fault detection processes [1]. Research has been conducted to create a link between the fault discovery process and the vulnerability discovery process for modeling purposes [4]. The earliest study on modeling the vulnerability discovery process was conducted in 2002, when Anderson [20] proposed the first VDM termed the Anderson Thermodynamic (AT) model. Since 2002, a few other VDMs have been proposed. Rescorla [6], [7] proposed a VDM to estimate the number of undiscovered vulnerabilities. In 2005, Alhazmi et al. [21] proposed the application of SRMs to vulnerability discover modeling. The same year, they also introduced a logistic VDM known as Alhazmi–Malaiya Logistic (AML) model, which assumes a symmetrical shape around the peak discovery rate value [8]. In another study [21], Alhazmi and Malaiya found that the AML model provides better goodness-of-fit results compared to Rescorla and Anderson models. 11 Moreover, an effort-based model was also introduced that uses system installations instead of calendar time as the independent factor. In other words, the authors argued that discovering a vulnerability associated with a software installed on a larger group of computers is more rewarding. However, the effort-based model requires knowing the number of users for a target product in market share, which is not always easy to obtain. A Weibull distribution-based VDM was proposed by Kim in 2007 [22]. The author argued that the assumption made by the AML model that the vulnerability discovery rate is symmetric around the peak is not always consistent. He leveraged a Weibull distribution to model the asymmetric trend of the discovery rate as an alternative to the AML model. However, the Weibull model did not always provide a good fit. Li et al. [23] empirically showed that, in comparison to other reliability models, a Weibull model is better for estimating defect occurrence across a wide range of software systems. Several studies applied existing models to different types of software packages, such as operating systems and web servers, to simulate the vulnerability discovery rate and predict the number of vulnerabilities that may potentially be present but not yet found [17], [18], [24]. Other studies tried to increase the accuracy of the vulnerability discovery modeling by examining the skewness of the vulnerability data [25]. Research on reliability and risk assessment with a continual vulnerability discovery process has recently started. Studies provided by Anderson [20], Rescorla [6], [7], Alhazmi and Malaiya [8], [18], [21], Kim [26], Ozment and Schechter [27], 12 Ozment [5], Chan et al. [28], Joh and Malaiya [25] proposed a number of VDMs capable of making different projections on vulnerabilities’ disclosure trends. 2.3 Vulnerability Risk Assessment and Modeling: Vulnerability Level Another approach for analyzing vulnerabilities is to find the risk associated with each vulnerability. Such an approach helps companies to make decisions with respect to the severity level of vulnerabilities. Ranking vulnerabilities is a hard task since it is often difficult to predict how attackers could exploit a vulnerability and use the exploit. Therefore, the risk that different vulnerabilities face is often unknown until they are exploited. In addition, comparing the severity of vulnerabilities is difficult when leveraging their descriptions. It is possible that a simple programing bug can lead to more harm than a major system flaw. Several studies analyze and model vulnerabilities based on their technical features such as exploitability and come up with better estimations. We can divide them into studies that focused on examining sources code of software, vulnerability life-cycle, CVSS metrics, and system calls. 2.3.1 Based on source code In addition to vulnerabilities’ publication dates, some studies used software source code for vulnerability assessment in the context of VDMs. Kim et al. [22] proposed a VDM based on shared source code measurements among multi-version software systems using the source code and vulnerability data of two major versions of Apache HTTP Web server and two major versions of Mysql DBMS. In 2006, Ozment and Schechter applied a reliability growth model to evaluate the security of the OpenBSD OS by examining its source code and the rate at which new code has been 13 introduced [27]. However, it has been shown that source code cannot be an adequately efficient measure in terms of prediction [5]. Younis [2] assessed vulnerability exploitability for individual vulnerabilities based on source code properties regardless of the availability or unavailability of a patch. In addition to the vulnerabilities publication dates, he took advantage of software source code for vulnerability analysis in the context of VDMs. Nagappan and Ball [15] performed a pre-release defect prediction using relative code churn metrics on Windows Server 2003. Their multiple linear regression model using principal components analysis provided a high correlation between the estimated failures and the actual failures in software modules (r=0.889 for the Pearson correlation and r=0.929 for the Spearman rank correlation). The relationships between code complexity and vulnerabilities of the Mozilla JavaScript engine at the function level was studied by Shin and Williams [29]. The correlations between code complexity and vulnerabilities were weak (Spearman r=0.3 at best) but statistically significant [29]. Shin et al. [30] through an empirical study showed that by using complexity and code churn metrics, VDMs are capable of predicting vulnerable code locations with high number of calls during a security breach. However, it may generate many false positives. In 2013, Shin [31] showed that the performance of fault and vulnerability prediction is largely affected by the number of the reported faults and vulnerabilities in previous releases. Recently, Nguyen et al. proposed an automated method that determines the code evidence for the presence of vulnerabilities in previous software versions to evaluate whether the target version is vulnerable or not [32]. 14 2.3.2 Based on vulnerability lifecycle Disclosure time, exploitation time, and patching time create the lifecycle of a vulnerability. In 2006, a vulnerability lifecycle model was presented by Arbaugh et al. [10] to measure the number of intrusions during the vulnerability lifecycle. Frei et al. [11], [12], linked the patching process to the lifecycle of a vulnerability. They extended Arbaugh et al.’s model using more than 80,000 vulnerabilities; they identified and measured three types of risk exposures as black, gray, and white. They also showed that exploits are often faster to occur than patches. This work has been extended by Shahzad et al. [33], who conducted a descriptive statistical analysis of a large software vulnerability dataset employing clustering on NVD and OSVDB datasets that included vendors and software. A risk measure was defined by Joh and Malaiya [34] as a probability of adverse events and their impacts. Using Markovian stochastic models, they utilized the vulnerability lifecycle to measure the likelihood of vulnerability exploitability for an individual vulnerability and for the whole system. However, the transition rate between vulnerability lifecycle events has not been determined and the probability distribution of lifecycle events remains to be studied. Zero-Day vulnerability and its lifespan have been defined by McQueen et al. [35]. Based upon their definition, the zero-day lifespan refers to the time between the vulnerability discovery date and the public disclosure date. They were able to identify the actual vulnerability discovery dates for 15 vulnerabilities. They also compared the CVSS base score to mean lifespan. Younis [2] considered time to vulnerability disclosure (TTVD) or lifespan starting from the vulnerability birth date and correlated 15 the TTVD with the CVSS base score. Bozorgi et al. [36] refused to do prediction of zero-day vulnerabilities since he believed that their reports occur with the vulnerability already exploited. 2.3.3. Based on CVSS metrics Common Vulnerability Scoring System (CVSS) metrics are used to measure the severity of vulnerabilities [2]. Exploitability (the ease of exploiting a vulnerability) and impact (the effect of the exploitation) are indicators of the severity. Joh and Malaiya in [34] leveraged the impact related metrics from CVSS to determine the exploitability impact. They applied their metric to assess the risk of two systems that had known unpatched vulnerabilities using actual data. Descriptive approaches and trends in scheduling of vulnerability patching and exploitation exist. However, most of them use exploit data from OSVDB that does not provide sufficient information about the actual exploitation of a vulnerability and usually the dates reported for exploits by this source are not accurate enough [37]. NVD timing data has also been reported to generate an unforeseeable amount of noise because of how the vulnerability disclosure process works [16], [37]. 2.3.4. Based on system calls System calls are entry points to privileged kernel operations [2]. A system call from a user function can violate the least privilege principle. The principle of least privilege indicates a security protocol, where each part of a system has only the privileges that are needed for its function. In this condition, even if attackers gain access to one part, they would have only limited access to the whole system [38]. An analysis 16 of UNIX system calls was presented by Bernaschi et al. [39]. They classified system calls according to their level of threat with respect to system penetration. To control system calls, they proposed the Reference Monitor for UNIX System (REMUS), a mechanism to detect an intrusion that may use these system calls. Younis [2] applied their idea utilizing system related attributes such as attack surface entry points, call function analysis, and the existence of dangerous system calls to measure the exploitability of a known vulnerability. 2.4 Methods of Analysis & Risk Assessment Strategies Several other strategies exist to provide a better understanding of software risk. One is splitting vulnerabilities based on their specifications and studying their behavior in specific subsets like zero-day vulnerabilities. In addition, some studies focused on finding an optimized plan for patch releases with respect to the vulnerability discovery process to diminish the side effects of malicious attacks. 2.4.1 Cluster-based analysis Clustering is a form of classification method that is very useful in understanding the complex nature of multivariable relationships. In others words, clustering is a method of searching data to detect similarities and dissimilarities in order to find a structure of natural groupings [40]. Clustering can be done for different purposes including splitting real-world exploited vulnerabilities from those which were exploited during software testing [41], and detecting exploited vulnerabilities versus non-exploited ones when there is not enough information about some vulnerabilities [42]. While clustering is categorized as 17 an explanatory method to simplify the interpretation of data, in most practical applications, to distinguish “good” groupings from “bad” groupings, the researcher should know enough about the context. Generally, clustering algorithms are divided into hierarchical methods (connectivity-based clustering, used when the number of items is less than 100), non- hierarchical methods (centroid-based clustering, used for more the 100 items), distribution-based clustering (used when Clusters can be defined as items belonging most likely to the same distribution), and density-based clustering (used when clusters are defined as areas of higher density than the remainder of the data set) [40]. Most efforts require a measure of “closeness”, or “similarity” to provide a group structure from a complex data set. When data are clustered, the similarity should be indicated by a measure of distance. The most common distance measures are the Euclidian distance, the geometric distance in multidimensional space, and the Mahalanobis distance which is based on the covariance matrix of the variables [40]. Lee et al [43] investigated a distributed denial of service (DDoS) attack detection method using cluster analysis. Looking into the steps needed to develop a DDoS attack and extracting several traffic variables which best illustrate each phase of the DDoS attack, the authors performed cluster analysis to find precursors for proactive detection of the attack. Shahzad et al [33] conducted a descriptive statistical analysis of a large software vulnerability dataset employing clustering on type-based vulnerability data. Huang et al [44] classified NVD vulnerabilities employing several clustering algorithms to create a relatively objective classification criterion among the vulnerabilities. 18 2.4.2. Machine learning Machine learning focuses on automatic recognition of complex patterns and making intelligent predictions or decisions based on data. The technique for learning a classifier (or a function) from training examples when each example is associated with a true label is called the supervised learning method, and it is composed of two main phases [45]. The first phase is the learning or training phase. In this phase, a machine learning algorithm is run on a fraction of training data that consists of pairs of input data (a vector of integers) and their associated output to learn a classifier. Testing is the second step where the pre-learned classifier is tested on the rest of the data to estimate the testing precision of the classifier. Bozorgi et al. [36] measured vulnerability severity based on analyzing vulnerability exploitability. They discussed the weakness of exploitability measures like CVSS base score metric in providing sufficient information about the vulnerability severity. CVSS metrics are the de facto standard that is used to measure the severity of vulnerabilities [46]. However, CVSS exploitability measures have come under some criticism. They [36] believed that CVSS metrics are only built upon expert knowledge and static formula. To remedy the situation, the authors proposed a machine learning model using supporting vector machines (SVMs) and a data mining technique that can predict the possibility of a vulnerability getting exploited. In their study, using the CVSS exploitability metric identified many vulnerabilities with a high severity score even though there were no known exploits for those vulnerabilities. This indicates that the 19 CVSS score does not differentiate between exploited and non-exploited vulnerabilities. This result was also confirmed by [16], [47]. Younis et al. [42] leveraged software properties such as the attack surface entry points, the source code structure, and the vulnerabilities location to determine the vulnerabilities’ exploitability. Younis’s approach is particularly important for newly released applications that do not have a large amount of historical vulnerabilities. Logistic Regression (LR), Naive Bayes (NB), Random Forests (RF), and Support Vector Machine (SVM) are the machine learning techniques employed in the study. The SVM, when the principal component analysis (PCA) was used, has performed best compared to the other classifiers. Sabottke et al. [41] explored early detection of exploits using information available on Twitter. They proposed the design of a Twitter-based exploit detector, using supervised machine learning techniques. They leveraged the exploit-related discourse tweets on Twitter (the tweets that included ‘CVE’) and extracted information posted on public vulnerability resources to evaluate the chances for early detection of the vulnerabilities at risk of being exploited in the presence of benign and adversarial noise. In other words, they investigated techniques for minimizing false-positive detections—vulnerabilities that are not actually exploited—which is critical for prioritizing response actions. 2.4.3. Optimal patch planning A security patch is a small program that fixes vulnerabilities. Patches usually get distributed to end-users to remove those vulnerabilities. Ideally, the best time to release a patch is the time when a vulnerability appears. However, the development and 20 distribution of patches involves considerable expenses for vendors. Additionally, a poorly designed patch may lead to introducing new issues. Thus, many of the vendors tend to release patches for their products in a pre-designed timeline [1]. Zheng et al. [48] presented a method for quantifying a security attribute called mean time to security failure (MTTSF) of a virtual machine-based (VM-based) intrusion tolerant system [49] based on queueing theory. They also presented a generalized scheme for tolerating intrusions in a VM-based intrusion tolerant system. In 2016, Luo et al. [50] discussed the patch release strategy from the perspective of vendors by cost criteria. The model assumed a non-homogeneous Poisson process (NHPP) as the number of vulnerabilities discovered, and formulated the expected total cost by considering the damage of exploiting vulnerabilities before and after a patch release. Some researchers have focused on proposing methods according to the cost of security breaches [51]; finding a cost function for vulnerabilities has remained a controversial topic. 2.5. Guidelines for Vulnerability Discovery Models Security decision makers often use public data sources to help make better decisions regarding, for example, what security products to choose, check for security trends, and estimate when new vulnerabilities that affect their installations will be publicly reported. Several studies have applied software reliability models (SRMs) and vulnerability discovery models (VDMs) to estimate times between public reports of vulnerabilities [1], [3], [5]–[8]. Few studies have tried to provide a guideline about which model should be used in a given situation. Joh et al. [25] investigated the relationship between the 21 performance of five S-shaped VDMs (i.e., AML, Weibull, Gamma, Normal, and Beta) and the skewness in vulnerability datasets for eight software. Their results showed that Gamma-based VDM, which is a right-skewed VDM, always yields better results with positively skewed (right-skewed) datasets than other models in terms of goodness of fit results and prediction capabilities. For the other VDMs used, no significant correlation was observed. In addition, the authors showed that the AML model performs better than some right-skewed VDMs in terms of prediction when the vulnerability discovery datasets are asymmetrical. Massacci et al. [24] proposed an empirical methodology that evaluates the performance of VDMs in terms of goodness of fit and predictability. They evaluated most existing VDMs (AT, Rescorla’s models, AML, Weibull, Linear) on 30 major releases of four web browsers (i.e., IE, Firefox, Chrome, Safari). They also classified the age of a browser’s version in three different periods: youth (within 6-12 months since release date), middle age (12-36 months since release date), and old age (beyond 30 months). Based upon their findings, for a young software, the linear model yielded the best results for estimating the vulnerabilities in the next 3-6 months. For middle- aged browsers the AML model was selected as the best model. Regarding modeling exploited vulnerabilities, one aspect consists of the probabilistic examination of intrusions by [52], [53]. The lack of data is a significant barrier to modeling exploited vulnerabilities using current VDMs or the machine learning techniques, which require considerable amount of data for satisfactory training. 22 Chapter 3: Datasets and Models 3.1 Introduction In this chapter, we will introduce the datasets used in this thesis. Then, we will present two groups of VDMs used for our analysis based upon the classification provided in [24]: S-shaped vulnerability discovery models (VDMs), and non S-shaped VDMs. 3.2 Vulnerability Dataset Creation The data used in this research has been collected from six different vulnerability data sources including the National Vulnerability Database (NVD)7 maintained by NIST, the Common Vulnerabilities and Exposures (CVE) database8, the CVE Details data source9, the Security database10, the SecurityFocus data source11, and the CXSecurity database12. All the datasets we used here, derived from those three databases, we previously introduced in Chapter 2. We used the data generated by a security tool called “VepRisk”13, which has a backend modules that mine, extract, and store data from public repositories of vulnerabilities. We then stored the data in our own database using MySQL, and identified each vulnerability by its Common Vulnerability Enumeration (CVE) identifier. We used the CVE identifier to compare the reporting date of each 7 https://nvd.nist.gov 8 https://cve.mitre.org 9 https://cvedetails.com/ 10 https://www.security-database.com/ 11 http://www.securityfocus.com/ 12 https://cxsecurity.com/ 13 http://veprisk.city.ac.uk/main 23 vulnerability in NVD, with the dates in other public repositories on vulnerabilities. We then updated the reporting date on our database to the earliest date that a given vulnerability was known in any of these databases. We used the NVD as the backbone of comparisons because it includes all the vulnerabilities that can be found in some of the other data sources. Even though some information might be missing in the NVD, it includes the fields that allows to search for the missing information in the other data sources. One example is the Common Platform Enumeration (CPE) identifier, which is only present in the NVD but can be used in combination with the CVE identifier to extract information from different sources. Another example is the vulnerability type that is only present in the CVE Details table. Vulnerability databases might have high uncertainty regarding some variables associated with the reported vulnerabilities such as published dates. Even though we tried to overcome this issue by collecting vulnerability data from several sources and filtering the published dates based upon the earliest date a vulnerability reported, we should be aware of the uncertain nature of the reported published dates. Regarding that, all the conclusions we draw from this research and their validity are limited by our database uncertainties. 3.3 Vulnerability Dataset Overview It is important to have a high number of vulnerabilities for each of our analyses. Thus, we decided to focus on two groups: operating systems (OSs) and web browsers. More specifically, we selected four OSs (Windows, Mac, IOS, and Linux) and four 24 web browsers (Internet Explorer, Safari, Firefox, and Chrome). The results presented in this thesis include the vulnerabilities until the end of 2018. 3.3.1 Operating systems We focused on vulnerabilities reported for four well-known OSs: Windows (1995-2018), Mac (1997-2018), IOS (the OS associated with Cisco) (1992-2018), and Linux (1994-2018). We chose these OSs as they are most widely used, and had the highest number of vulnerabilities. The start dates indicate the first vulnerability occurrence for the specific OS. For each OS, we included all the vulnerabilities reported for any of its versions. For instance, all the vulnerabilities reported for mac_os, mac_os_server, mac_os_x, and mac_os_x_server were put together to create a vulnerability database for Mac. We did this to have a high number of vulnerabilities for each OS. The total number of distinct vulnerabilities (unique CVE-IDs) for these OSs is 12,852. Table I presents the number of vulnerabilities for the four OSs. 3.3.2 Web browsers We focused on the vulnerabilities reported for four well-known web browsers: Internet Explorer (1997-2018), Safari (2003-2018), Firefox (2003-2018), and Chrome (2008-2018). These browsers were selected since they are widely used and had the highest number of vulnerabilities. The start dates indicate the first vulnerability occurrence for the specific web browser. Similar to what we did for OSs, we considered, for each browser, all the vulnerabilities reported for any of its versions. As an example, all the vulnerabilities reported for ie, ieexplorer, and ie_for_macintosh were combined under Internet Explorer. This allowed us to have a high number of 25 vulnerabilities for each browser. The total number of vulnerabilities for these browsers is 6,546. Table 1 presents the number of vulnerabilities for the four browsers. Table 1: NUMBER OF VULNERABILITIES PER SOFTWARE OS Windows Mac IOS Linux # Vulnerabilities 3434 2908 698 5812 Web Browsers IE Safari Firefox Chrome # Vulnerabilities 1862 994 1784 1906 3.4 Datasets Characterization In this section, we characterize the datasets using well-known statistical indicators. Distributions are characterized by their first four moments and indicators like skewness and kurtosis highlight distribution properties. In this section, we will characterize the datasets based on their skewness since this indicator is widely used by the vulnerability modeling community [22], [25]. The skewness of a dataset/distribution specifies its degree of asymmetry around its expected value [25]. The skewness values are calculated via following equation [40]: 𝑛 𝑥𝑖 − ?̅? 3 𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠 = ∑ ( ) (1) (𝑛 − 1)(𝑛 − 2) 𝑠 where n is the number of data points, 𝑥𝑖 is the 𝑖 𝑡ℎ data value, ?̅? represents the mean value and s is the standard deviation. For a given distribution with an absolute skewness value of greater than 1, between 0.5 and 1, and less than 0.5, the distribution is highly skewed, moderately skewed, and approximately symmetric, respectively [40]. The datasets with an absolute value greater than 0.5 and positive skewness values are called right-skewed datasets and referred to those datasets where more vulnerabilities are 26 reported later in the product lifecycle, and vice versa. Figure 1 shows an overall shape of the vulnerability discovery process with different skewness values. Overall view of a distribution with (a) negative skewness, (b) appoximately zero skewness, and (c) positive skewness14 Table 2 presents the skewness value of each software used in this research. All the datasets are right-skewed (each has a positive skewness value with an absolute value of skewness greater than 0.5). In other words, for a software with a right-skewed dataset, plots of the number of discovered vulnerabilities associated with the software grouped by time blocks lead to a significant number of vulnerabilities on the left side of the plot. Table 2: SKEWNESS VALUES PER SOFTWARE OS Windows Mac IOS Linux # Skewness 2.381 2.459 3.687 3.589 Web Browsers IE Safari Firefox Chrome # Skewness 2.705 3.381 11.00 1.973 3.5 S-shaped Vulnerability Discovery Models S-shaped VDMs, which measure the total number of detected vulnerabilities, divide the process of vulnerability discovery into three phases as shown in Figure 2. 14 H. Joh and Y. K. Malaiya, “Modeling Skewness in Vulnerability Discovery: Modeling Skewness in Vulnerability Discovery,” Qual. Reliab. Eng. Int., vol. 30, no. 8, pp. 1445–1459, Dec. 2014. 27 Phase 1 represents the learning phase, which starts from the introduction of the software and continues until the beginning of the linear phase as a consequence of increasing popularity of the software [25]. During the learning phase, the vulnerability discovery intensity function is an increasing function. Phase 2 or the linear phase is the period when most of the vulnerabilities are detected. The intensity function associated with the vulnerability discovery process of this phase is constant. Phase 3 or the saturation phase is the period when most of the vulnerabilities have been discovered and only a few vulnerabilities remain undiscovered [24]. The vulnerability discovery intensity function for the saturation phase is decreasing. Note that the saturation phase might not be observable for all software. This phase will not appear as long as a significant number of vulnerabilities are still undetected. The S-shaped VDMs used in this research are not distributions. They are built based upon well-known distributions and their purpose is to count the total number of vulnerabilities [24]. Regarding this, for all the models we used in this study, we only applied their cumulative forms Ω(𝑡) on vulnerability data. For each software, the variable we are predicting is the cumulative number of vulnerabilities reported in 30 days’ time intervals. In other words, t is associated with 30 day intervals and Ω represents the cumulative number of vulnerabilities reported within each interval. We use the model skewness to select the S-shaped VDMs in this research. We selected five S-shaped VDMs: one right-skewed model (Gamma-based VDM), one flexible-skewed model (Weibull-based VDM), and two symmetrical models (Alhazmi–Malaiya Logistic (AML) model and Normal distribution-based model). These VDMs were selected because they are the most well-known right-skewed, 28 flexible-skewed, and symmetrical distribution-based VDMs in modeling the vulnerability discovery process. We detail in the following sections these five models. Phase 1: Phase 3: Learning Saturation phase phase Phase 2: Linear phase Three phases for S-shaped models 3.5.1. Gamma-based VDM The Gamma-based VDM, derived from the Gamma distribution, belongs to the family of right-skewed distributions. It has a continuous intensity function with three parameters: α (shape parameter), β (scale parameter), and 𝛾, which represents the total number of vulnerabilities that would finally be discovered. The vulnerability discovery rate/intensity function, ω, for the Gamma-based VDM as well as its cumulative model (Ω) presented in (2) and (3), respectively. 𝛾 𝑡− ω(𝑡) = 𝑡𝛼−1𝑒 𝛽 (2) Γ(𝛼)𝛽𝛼 ∞ 𝑤ℎ𝑒𝑟𝑒 Γ(𝛼) = ∫ 𝑡𝛼−1𝑒−𝑡 𝑑𝑡 0 𝑡0 𝛾 𝑡 Ω(𝑡 ) = ∫ 𝑡𝛼−1 − 0 𝑒 𝛽 𝑑𝑡 (3) Γ(𝛼)𝛽𝛼 𝑡=0 29 Cumulative # of vulnerabilities This distribution is only defined for t >0. The shape and the scale parameters are always positive. It is expected that for the software with large values of t, right- skewed distributions provide better fits to vulnerability discovery data than other models [25] because of gradual reduction in the number of discovered vulnerabilities, which yields a tail on the right side of the relevant vulnerability discovery intensity function. 3.5.2 Weibull-based VDM The Weibull-based VDM, derived from the Weibull distribution, belongs to the family of flexible-skewed distributions. This VDM was first introduced in 2007 [22]. The vulnerability discovery rate/intensity function, ω, for the Weibull-based VDM as well as its cumulative model (Ω) presented in (4) and (5), respectively. 𝛼 𝛼𝛾 𝑡 𝛼−1 𝑡−( ) ω(𝑡) = ( ) 𝑒 𝛽 (4) β 𝛽 𝑡 𝛼 −( ) Ω(𝑡) = 𝛾 {1 − 𝑒 𝛽 } (5) Like the Gamma-based VDM, the Weibull-based VDM has a continuous intensity function with three parameters: α (shape parameter), β (scale parameter), and 𝛾 which represents the total number of vulnerabilities that would finally be discovered. This VDM can be symmetrical with zero skewness for α values around 3. For α <3, this VDM is always right-skewed, while for α >3, it is left-skewed. Like the Gamma- based VDM, this distribution is defined for t >0. 30 3.5.3 AML VDM Alhazmi–Malaiya Logistic (AML) model belongs to the family of distributions with symmetrical intensity (rate) functions. This model was first introduced in 2005 [8] and is based upon the idea that as an operating system gains market share, the attention it receives increases. Then, after experiencing a peak, it starts decreasing when a newer version is released. Overall, the AML model assumes the cumulative number of vulnerabilities is influenced by two factors: the share of the installed base (increasing factor) and the number of remaining undiscovered vulnerabilities (declining factor). The AML model has three parameters including a constant C. Parameters A and B are empirical constants and directly estimated from the dataset. B stands for the total number of vulnerabilities that would finally be discovered. This model is defined for time values t from the negative infinity to the positive infinity, and the parameters must be positive. The vulnerability discovery rate/ intensity function (ω) for the AML VDM as well as its cumulative model (Ω) presented in (6) and (7), respectively. ω(𝑡) = 𝐴Ω(B − Ω) (6) 𝐵 Ω(𝑡) = (7) 𝐵𝐶𝑒−𝐴𝐵𝑡 + 1 3.5.4 Normal-based VDM The Normal-based VDM belongs to the family of distributions with symmetrical intensity/probability density functions. This model presents a distribution with zero skewness that has three parameters: μ is a location parameter, σ is a scale parameter and 𝛾 is the total number of vulnerabilities that would eventually be 31 discovered. The vulnerability discovery rate/intensity function (ω) for the Normal- based VDM as well as its cumulative model (Ω) presented in (8) and (9), respectively. (𝑡−𝜇) 𝛾𝑒− 𝑠 ω(𝑡) = 2 (8) (𝑡−𝜇) s (1 + 𝑒− 𝑠 ) 𝛾 Ω(𝑡) = (9) (𝑡−𝜇) 1 + 𝑒− 𝑠 The Normal-based VDM has lighter tails on both sides in comparison to the logistic distribution used for the AML model. For a dataset with fewer vulnerabilities discovered at the beginning and at the end of a discovery process, the Normal VDM might be a better fit than the AML model [25]. 3.5.5 Younis Folded VDM The normal distribution is symmetric around its mean and is defined for a random variable that takes values from -inf to +inf. In some cases, a distribution is needed that has no negative values. Folded distributions are kinds of asymmetrical models obtained by folding the negative values into the positive side of the distribution. The folded distribution has been found usable in industrial practices such as measurement of flatness and straightens. In the Younis folded VDM [54] vulnerability discovery starts at time t = 0 which corresponds to the release time of the software. In this model, t represents the calendar time, 𝜏 is a location parameter, σ is a scale parameter, and 𝛾 represents the number of vulnerabilities that will be eventually discovered. The second term in its cumulative Ω(𝑡) equation (13) represents the part of the distribution folded to the 32 positive side, which shows the discovery process for the Folded VDM. In the equation erf(.) is the error function. This distribution is defined for 𝑡 ≥ 0. 𝛾 (𝑡 − 𝜏)2 (𝑡 + 𝜏)2 ω(𝑡) = {exp − ( ) + exp − ( )} (12) √2𝜋𝜎 2𝜎2 2𝜎2 𝛾 𝑡 − 𝜏 𝑡 + 𝜏 Ω(𝑡) = {erf ( ) + erf ( )} (13) 2 √2𝜎 √2𝜎 Compared to AML, the Folded VDM has shorter learning phase or missing learning phase which makes the normal distribution asymmetric. It results in a higher discovery rate at the beginning which may be especially applicable to the cases where the vulnerability discovery plot is in linear phase even at the beginning. 3.6 Non-S-shaped Vulnerability Discovery Models In addition to the S-shaped models described in Section 3.5, we also considered several VDMs that are not S-shaped and cannot be characterized solely by their skewness (note: we used the classification presented in [24]). These models are Power- law, Rescorla Quadratic (RQ), and Rescorla Exponential (RE). 3.6.1 Power-law Software Reliability Growth Models (SRGM-based) Research has been conducted to find a link between the fault discovery process of a software and the discovery process of its vulnerabilities for modeling purposes [4]. When considering the fault detection process of a software, it is justifiable to conclude that software reliability growth models (SRGMs) and vulnerability discovery models (VDMs) are similar [1]. In such cases, the intensity/rate function can represent the detection rate of vulnerabilities. 33 When modeling the cumulative number of failures Ω(𝑡) for software reliability evaluations, models derived from a nonhomogeneous Poisson process (NHPP) are often used. Allodi [55] showed that the vulnerability exploitation may follow a Power- law distribution. However, such models have several assumptions. The main one is that the number of detected vulnerabilities follows a nonhomogeneous Poisson process. In addition, if we consider a software as a repairable system, its intensity function ω(t) = dE[Ω(t)]/dt, is often, for simplicity, assumed a monotonic function of t. Therefore, in NHPP-based software reliability growth models (SRGMs) or NHPP-based VDMs, the intensity function (the detection rate of software errors/the detection rate of vulnerabilities) is considered to be a monotonic function [56]. The equations associated with the Power-law model are presented in (10) and (11), respectively. This model is continuous over time and has two parameters: α (shape parameter), β (scale parameter). 𝛼 𝑡 𝛼−1 ω(𝑡) = ( ) (10) β 𝛽 Ω(𝑡) = (𝛽−𝛼). 𝑡𝛼 (11) 3.6.2 Rescorla Exponential (RE) Rescorla’s models are simple exponential and quadratic models that are commonly used and first were introduced by Rescorla in vulnerability analysis area [7]. In equation (14) 𝛾 represents the number of vulnerabilities that will be eventually discovered, and 𝜆 represents the detection rate of software errors/the detection rate of vulnerabilities. Ω(𝑡) = 𝛾(1 − 𝑒−𝜆𝑡) (14) 34 3.6.3 Rescorla Quadratic (RQ) Equation (15) is a simple quadratic model introduced by Rescorla [7]. Parameters A and B are empirical constants and directly estimated from the dataset. 𝐴𝑡2 Ω(𝑡) = + 𝐵𝑡 (15) 2 3.7 Summary In this chapter, we introduced the datasets used in this thesis. We presented the models used for our analysis in two categories of S-shaped VDMs, and non S-shaped VDMs. In the next chapter, we will apply these models on the vulnerability datasets and compare their curve-fitting and prediction capabilities. 35 Chapter 4: Non Cluster-based Vulnerability Assessment 4.1 Introduction In this chapter, we will apply the VDMs introduced in Chapter 3 on the vulnerability datasets associated with the operating systems and web browsers also introduced in Chapter 3. Then, we will compare the curve-fitting and prediction capability of these models and will investigate which models perform better in a given situation. Finally, we will present some guidelines to model vulnerability discovery data based upon common VDMs. 4.2 Motivation Among the research conducted on vulnerability modeling, few studies have tried to provide a guideline about which model should be used in a given situation. In other words, assuming the vulnerability data is provided, the research question addresses the following. Is there any features in the vulnerability data that could be used for identifying the most appropriate models for that dataset? What models are more accurate for vulnerability discovery process modeling? In this chapter, we make the following contributions: -We compare the curve fitting and prediction capabilities of eight VDMs (i.e. one right-skewed distribution-based model, one flexible-skewed distribution-based model, three symmetric distribution-based models, one Power-law model, one exponential model, and one quadratic model) on two types of software (i.e. OSs and Web Browsers). 36 - We present some guidelines to model vulnerability discovery data based upon common VDMs. - Based upon our findings from estimation results, we show that the Gamma- based VDM was the most accurate model for the datasets by being better in 62.5% of the cases. - We also show that based upon our findings from prediction results, the Power- law model provides most accurate predictions for the datasets by performing better than other models in 62.5% of the cases. 4.3 Analysis For each software, the variable we used in this research is the cumulative number of vulnerabilities reported in 30 days’ time intervals. In other words, we splitted the study period associated with a given software into intervals of 30 days, and counted the total number of vulnerabilities detected in each interval. Then, we accumulated the total number of detected vulnerabilities for each interval. For curve fitting, the eight models were fitted to the eight datasets (for the four OSs and the four web browsers) using a regression method described in [24]. To avoid overfitting, 10-fold cross validation was also conducted using Python’s sklearn library [57]. The analysis of the prediction capability is done for 2016, 2017, and 2018. During the training period (before 2016), all the available data was used to estimate model parameters. The estimated final values for each interval produced by the eight models were compared with the actual number of vulnerabilities to calculate the prediction accuracy. 37 4.3.1 Curve-fitting error indicators We applied the Chi-square (χ2) goodness of fit test [24] to see how well each model fits the datasets. The χ2 statistic is calculated using the following equation: 𝑁 (𝑆 − 𝑂 )2𝑖 𝑖 χ2 = ∑ (16) 𝑂𝑖 𝑖=1 where 𝑆𝑖 and 𝑂𝑖 are the simulated and real observed values at 𝑖 𝑡ℎ interval, respectively. N is the number of observations. For the fit to be acceptable, the corresponding χ2 critical value should be greater than the χ2 statistic for the given alpha level and degrees of freedom. We selected an alpha level of 0.05. The null hypothesis indicates that the actual distribution is well described by the fitted model. Hence, if the p-value of the χ2 test is below 0.05, then the fit will be considered unsatisfactory. A p-value closer to 1 indicates a better fit. R2 is another fitting statistic used in regression analysis [58]. A R2 value close to 1 indicates a good fit. R2 values are usually used in linear regression analysis [59] and might lead to some inaccuracy problem in the case of non-linear regression analysis. However, since this metric has been used in previous studies, we decided to also calculate it so that our results could be compared with these studies. The root mean square error (RMSE) is often used in research to calculate fitting errors. However, Mentaschi et al. [60] showed that for some applications (e.g., high fluctuation of real data) the lower values of RMSE are not always a reliable indicator of the accuracy of simulations. Hence, a corrected estimator HH was proposed by Hanna and Heinold [61]: 38 ∑𝑁 2 𝐻𝐻 = √ 𝑖=1 (𝑆𝑖 − 𝑂𝑖) 𝑁 (17) ∑𝑖=1 𝑆𝑖𝑂𝑖 where 𝑆𝑖 is the 𝑖 𝑡ℎ simulated data, 𝑂𝑖 is the 𝑖 𝑡ℎ observation and N is the number of observations (the time blocks used for simulation). The closer to zero HH is, the more accurate the model. 4.3.2 Prediction error indicators We calculated two normalized predictability measures, average error (AE) and average bias (AB) [25]. AE is a measure of how well a model predicts throughout the test phase, and AB indicates the general bias of the model, which assesses its tendency to overestimate or underestimate the number of discovered vulnerabilities. AE and AB are defined as: 𝑛 1 Ω𝑡 − Ω 𝐴𝐸 = ∑ | | (18) 𝑛 Ω 𝑡=1 𝑛 1 Ω𝑡 − Ω 𝐴𝐵 = ∑ (19) 𝑛 Ω 𝑡=1 where n is a total number of intervals (one per recorded detection date) over the prediction period, and Ω is the actual number of total vulnerabilities, whereas Ω𝑡 is the estimated number of total vulnerabilities at time t. AB can be positive (for overestimation) or negative (for underestimation), while AE is always positive. 39 4.4 Curve-Fitting Results The eight models were fitted to the eight datasets (for the four OSs and the four web browsers) using a regression method described in [24]. To avoid overfitting, 10- fold cross validation was also conducted using Python’s sklearn library [57]. In each case, the model that has the smallest value of HH is selected as the best fitting model and highlighted in green. If the HH values of two models were equal, we used the RMSE values to differentiate them. The model with a higher value of RMSE becomes the second best model, and highlighted in yellow. For models with equal HH and almost equal RMSE (difference<=0.001), both models were highlighted in green. 4.4.1 Operating systems The data and the fitted curves for the OSs are shown in Figure 3. Table 3 contains the χ2 goodness of fit test p-values, the values of R2, RMSE and HH for the four OSs. For the OSs, the Power-law model is only statistically sound with p-values greater than 0.05 for Linux. All the other VDMs are each significant in two cases. When comparing HH values for Windows, IOS, and Linux, we observe that Gamma-based and Weibull- based VDMs performed as well as the Power-law model. However, for Windows and IOS, they are not statistically significant based on the generated p-values. 40 Windows Mac IOS Linux Fitted models for operating systems 41 YF RQ RE Power-law Normal AML Weibull Gamma For Windows, the YF VDMs provides the best fit since it has smallest HH and RMSE values as well as p-values greater than 0.05. For Mac, the Gamma and Weibull- based VDMs as well as Power-law model and YF VDM led to statistically sound fits. However, the Gamma-based VDM was selected as the best fit due to HH and RMSE values lower than the other models. For IOS, all the models except RQ are statistically sound. The models with smaller fitting errors (AML and Normal-based VDMs) were selected as the best fits. For Linux, like Windows and IOS, all the models except RE and YF are statistically sound. However, the Gamma and Weibull-based VDMs as well as the Power-law have smallest HH values and relatively equal RMSEs. Thus, they are the most accurate fits. In addition, from Figure 3 we found that for the considered OSs, the vulnerability discovery intensity function is still increasing. These software appear to be in phase 2 (cf. Figure 2) and many vulnerabilities in these products are yet to be discovered. Table 3: CURVE FITTING ACCURACY FOR OSS Windows Mac p-value 𝑹𝟐 RMSE HH p-value 𝑹𝟐 RMSE HH Gamma 0.686 0.998 38.549 0.033 0.193 0.993 52.139 0.061 Weibull 0.936 0.998 33.010 0.028 0.703 0.991 57.979 0.068 AML 0.956 0.999 30.643 0.026 0.001 0.988 67.885 0.080 Normal 0.956 0.999 30.704 0.026 0.001 0.987 68.033 0.080 Power-law 0.466 0.994 59.757 0.051 0.058 0.984 76.679 0.090 RE 0.026 0.983 102.871 0.088 0.000 0.968 108.229 0.128 RQ 0.888 0.995 58.298 0.050 0.002 0.988 67.651 0.080 YF 0.200 0.999 27.002 0.023 0.193 0.990 60.709 0.071 IOS Linux p-value 𝑹𝟐 RMSE HH p-value 𝑹𝟐 RMSE HH Gamma 0.981 0.995 9.727 0.054 0.640 0.985 105.371 0.085 Weibull 0.981 0.995 9.379 0.052 0.640 0.985 105.264 0.085 AML 0.990 0.997 7.391 0.041 0.630 0.977 129.095 0.105 Normal 0.990 0.997 7.403 0.041 0.630 0.977 129.338 0.105 Power-law 0.642 0.995 9.357 0.052 0.640 0.985 105.234 0.085 RE 0.150 0.998 9.044 0.053 0.006 0.983 110.346 0.089 RQ 0.000 0.973 22.993 0.128 0.222 0.985 105.901 0.086 YF 0.370 0.998 6.334 0.035 0.006 0.982 114.012 0.092 42 4.4.2 Web browsers The data and the fitted curves for the web browsers, are shown in Figure 4. Table 4 contains the χ2 goodness of fit test p-values, the values of R2, RMSE and HH for the four web browsers. For IE, all the models are statistically sound and YF has the lowest HH value. Thus, it was selected as the best fit. For Safari, although all the models except RE are statistically sound, Gamma-based VDM provided a better fit than other models based on the HH values. For Firefox, Gamma-based VDMs is the best model since they have p-values greater than 0.5 and smaller fitting error (HH and RMSE) than the other models. For Chrome, Gamma-based VMD provided the best fit due to having the smallest HH and RMSE values. Table 4: CURVE FITTING ACCURACY FOR WEB BROWSERS IE Safari p-value 𝑹𝟐 RMSE HH p-value 𝑹𝟐 RMSE HH Gamma 0.624 0.977 50.137 0.098 0.245 0.993 17.915 0.056 Weibull 0.624 0.978 50.051 0.098 0.739 0.993 18.945 0.059 AML 0.403 0.975 53.035 0.104 0.050 0.991 21.297 0.066 Normal 0.403 0.975 53.151 0.104 0.050 0.991 21.365 0.066 Power-law 0.624 0.978 50.041 0.098 0.378 0.984 27.731 0.086 RE 0.403 0.983 53.307 0.099 0.012 0.969 38.621 0.121 RQ 0.624 0.978 59.149 0.099 0.549 0.984 27.413 0.085 YF 0.285 0.983 44.100 0.086 0.986 0.992 19.629 0.061 Firefox Chrome p-value 𝑹𝟐 RMSE HH p-value 𝑹𝟐 RMSE HH Gamma 0.378 0.998 18.683 0.028 0.368 0.998 20.094 0.036 Weibull 0.307 0.998 18.826 0.029 0.368 0.994 32.551 0.059 AML 0.250 0.993 34.519 0.053 0.690 0.995 29.213 0.053 Normal 0.250 0.993 34.630 0.053 0.690 0.995 29.415 0.053 Power-law 0.307 0.998 20.032 0.031 0.240 0.962 83.910 0.153 RE 0.115 0.992 36.779 0.056 0.000 0.937 107.974 0.199 RQ 0.193 0.996 24.914 0.038 0.000 0.973 70.106 0.127 YF 0.150 0.996 24.543 0.037 0.400 0.996 25.641 0.046 43 IE Safari Firefox Chrome Fitted models for web browsers 44 YF RQ RE Power-law Normal AML Weibull Gamma In Figure 4, the vulnerability data associated with IE and Chrome show a saturation phase by the end of the learning phase, which means their discovery intensity function has a decreasing trend at the end (i.e., the rate of discovery of new vulnerabilities is predicted to decrease). Thus, these products appear to be in the saturation phase (cf. Figure 2). For Safari, and Firefox, the discovery intensity functions are increasing and constant, respectively. 4.3.4 Summary of Estimation Results Overall, in terms of curve fitting, considering the OSs, Gamma and YF VDMs were the best models in 50% of the cases, while the Weibull-based VDM and Power- law model were the best models in 25% of the cases. Considering the web browsers, Gamma-based VDM provided the best fits in 75% of the cases, whereas the YF VDM was the best model in 25% of the cases. The other VDMs were better in none of the cases. However, considering the OSs and the web browsers (8 cases), Gamma, Weibull, AML, Normal, Power law, RE, RQ, and YF performed well in 5 (62.5%), 1(12.5%), 0(0%), 0(0%), 1(12.5%), 0(0%), 0(0%), and 3(37.5%) cases, respectively. Therefore, based upon our findings from estimation results, the Gamma-based VDM was the most accurate model for the datasets we analyzed. 4.5 Prediction Results The analysis of the prediction capability is initiated after two-thirds of the time period from the beginning of the vulnerability discovery process. During the training period, all the available data was used to estimate model parameters. The estimated final values for each 30-day interval produced by the eight models were compared with 45 the actual number of vulnerabilities to calculate the prediction accuracy. The model that has the smallest value of AE and p-value≥0.05 was selected as having the best prediction capability and is highlighted in green. Regarding p-values, we used * to show the models with p<0.05. If the AE values of two models were equal, we selected the best model based upon the AB and HH values. The model with a higher bias or HH was selected as the second best model and is highlighted in yellow. 4.5.1 Operating systems The normalized error values ((Ω𝑡 − Ω)/Ω) for the OSs are shown in Figure 5. Table 5 presents the values of AE, AB, R2 and HH for the four OSs in our study. Comparing prediction capabilities for Windows, we found that the Power-law model has the smallest AE, AB, R2, and HH values. Table 5: PREDICTION ACCURACY FOR OSS Windows Mac AE AB 𝑹𝟐 HH AE AB 𝑹𝟐 HH Gamma 0.063 -0.061 271.445 0.095 0.218 -0.218 595.482 0.263 Weibull 0.091 -0.091 367.988 0.131 0.233 -0.233 640.863 0.287 AML 0.138 -0.138 511.292 0.187 0.278* -0.278 761.332 0.351 Normal 0.138 -0.138 511.292 0.187 0.278* -0.278 761.328 0.351 Power-law 0.037 0.025 119.646 0.040 0.074 -0.074 198.921 0.080 RE 0.106* 0.106 314.463 0.100 0.024* 0.017 95.639 0.037 RQ 0.039 0.030 125.961 0.042 0.082* -0.082 221.121 0.090 YF 0.114 -0.114 438.161 0.158 0.256 -0.256 703.498 0.320 IOS Linux AE AB 𝑹𝟐 HH AE AB 𝑹𝟐 HH Gamma 0.018 0.000 15.316 0.025 0.268 -0.268 1382.822 0.354 Weibull 0.019 0.005 17.105 0.027 0.267 -0.267 1378.715 0.352 AML 0.076 0.076 57.333 0.088 0.272 -0.272 1423.009 0.366 Normal 0.076 0.076 57.332 0.088 0.272 -0.272 1423.002 0.366 Power-law 0.019 0.006 17.366 0.028 0.267 -0.267 1377.862 0.352 RE 0.131 0.131 98.947 0.148 0.190* -0.190 987.650 0.239 RQ 0.154* -0.154 98.230 0.172 0.278 -0.278 1431.530 0.369 YF 0.092 0.092 70.902 0.108 0.240* -0.240 1248.693 0.313 For Mac, the Power-law model has the smallest values of AE, AB and HH. For IOS, the Gamma-based VDM has the smallest value of AE. For Linux, the Weibull- 46 based VDM and Power-law model have the best results. Note that in Table 5 negative values of AB indicate that the model may underestimate the total number of discovered vulnerabilities. Normalized prediction error values for the models (OSs) 4.5.2 Web browsers The normalized error values ((Ω𝑡 − Ω)/Ω) associated with the web browsers are shown in Figure 6. Table 6 presents the values of AE, AB, R2 and HH for the four web browsers in our study. For IE, YF VDM had the smallest error values. For Safari, the Power-law model had the smallest prediction error values. For Firefox, Weibull- based VDM provided the best prediction capabilities. For Chrome, the power-law model had the smallest error values. 47 Table 6: PREDICTION ACCURACY FOR WEB BROWSERS IE Safari AE AB 𝑹𝟐 HH AE AB 𝑹𝟐 HH Gamma 0.234 -0.234 402.761 0.273 0.156 -0.156 159.467 0.201 Weibull 0.233 -0.233 400.858 0.272 0.187 -0.187 190.325 0.245 AML 0.157 -0.157 270.708 0.175 0.231 -0.231 228.863 0.304 Normal 0.157 -0.157 270.706 0.175 0.231 -0.231 228.863 0.304 Power-law 0.233 -0.233 400.719 0.271 0.030 0.026 32.144 0.037 RE 0.149 -0.149 255.874 0.164 0.133* 0.133 130.867 0.141 RQ 0.232 -0.232 398.766 0.270 0.041 0.040 41.458 0.047 YF 0.141 -0.141 242.287 0.155 0.211 -0.211 212.086 0.278 Firefox Chrome AE AB 𝑹𝟐 HH AE AB 𝑹𝟐 HH Gamma 0.051 0.035 103.041 0.066 0.281 -0.281 527.771 0.367 Weibull 0.049 0.031 98.984 0.064 0.317 -0.317 590.544 0.422 AML 0.081 -0.081 162.703 0.112 0.307 -0.307 571.317 0.405 Normal 0.081 -0.081 162.703 0.112 0.307 -0.307 571.314 0.405 Power-law 0.069 0.067 140.551 0.089 0.167 0.167 355.845 0.191 RE 0.161 0.161 287.176 0.174 0.364* 0.364 776.614 0.383 RQ 0.096 0.096 180.913 0.113 0.077* 0.077 181.493 0.102 YF 0.051 -0.032 103.839 0.069 0.304 -0.304 567.935 0.402 Normalized prediction error values for the models (web browsers) 48 4.5.4 Summary of prediction results Overall, in terms of prediction, considering the OSs, the Power-law model performed better in 3 (75%) cases. Gamma and Weibull-based VDMs provided better prediction results in 1 (25%), and 1 (25%) cases, respectively. Normal-based, AML, RE, RQ, and YF VDMs were selected in none of the cases. Considering web browsers, the Power-law model led to better predictions in 2 (50%), while Weibull-based VDM along with YF VDM each were selected as the best predictor in 1 (25%) case. However, considering the OSs and the web browsers together (8 cases), Gamma, Weibull, AML, Normal, Power law, RE, RQ, and YF VDMs provided satisfactory prediction results in 1 (12.5%), 2 (25%), 0 (0%), 0 (0%), 5 (62.5%), 0 (0%), 0 (0%), and 1 (12.5%) cases, respectively. Then, based upon our findings from prediction results, the Power-law model yielded most accurate predictions for the datasets we analyzed. Please remember that all the conclusions we draw from this research and their validity are limited by our database uncertainties. 4.6 Discussion Based on our results, we found that a model’s ability to provide a good fit does not necessarily guarantee superior prediction capabilities. To the best of our knowledge, there are few studies that have tried to provide guidance about which model should be used in a given situation. Joh et al. [25] investigated the relationship between the performance of five S-shaped VDMs (i.e., AML, Weibull, Gamma, Normal, and Beta) and the skewness in vulnerability datasets for eight software. 49 Comparing our findings with those from Joh et al. [25], in terms of prediction, we didn’t find any cases out of eight cases among asymmetrical datasets where the AML VDM performed better than right-skewed distribution models - Joh et al. [25] stated that AML model performs better than some right-skewed distribution models in terms of prediction when the vulnerability discovery datasets are asymmetrical. The model tendency to overestimate or underestimate the results is another factor, which plays an important role in the procedure of model selection. We evaluated the bias values (AB) and explained that the final decision is up to the researcher to choose the best model based upon his/her priorities. However, from a security point of view, it is better to choose a model, which provides more conservative prediction results, if it has justifiable error values. In the current study, among the models that were selected as the best predictors, seven models provided overestimated results. Other selected models underestimated the number of vulnerabilities. 4.7 Limitations There are several limitations to our work that prevent us from making more general conclusions. One limitation is with regard to the uncertainty of the databases we used. Vulnerability databases usually might have some uncertainty regarding variables associated with the reported vulnerabilities such as published dates. Another limitation is associated with using SRMs (the Power-law model) as VDMs. Software reliability models usually assume that the time between failures represents total usage time of that product. What we are using is calendar time, which may not be a good proxy for usage. One important difference with security studies is the difficulty in estimating the “attacker effort” - the total amount of time that an 50 attacker spends in finding a vulnerability - which is something that is not needed in the context of reliability (we assume the users accidentally encounter faults that lead to failures, hence usage time is a good enough proxy for time between failures). A useful discussion of this is given in [30]. We have used all the vulnerabilities for all the versions of the products in our study. While number of studies utilize vulnerability data associated with separate version of software (e.g. Windows 7) on which to apply VDMs, there are papers that consider all versions of a software together [25], [62]. The first group expects that each version of a given software is an independent and all around characterized item, yet distinguishing the sources of reliance in vulnerability data is not a simple task. 4.8 Summary In this chapter, we applied the models introduced in Chapter 3 on the vulnerability datasets associated with the operating systems and web browsers also discussed in Chapter 3. Then, we compared the curve-fitting and prediction capability of these models and investigated which models perform better in a given situation. Finally, we presented some guidelines for using eight common VDMs to model vulnerability discovery data based on a given dataset. In next chapter, we will test whether a clustering approach improves the accuracy of the curve-fitting/prediction results. 51 Chapter 5: Clustering 5.1 Introduction In this chapter, we will test whether a clustering approach improves the accuracy of the curve-fitting/prediction results. We will focus on how clusters were created for the different vulnerability datasets associated with the operating systems and web browsers discussed in Chapter 3 and apply the five S-shaped VDMs and three non-S-shaped VDMs also introduced in Chapter 3 on them. Then, we will compare the curve-fitting and prediction capability of these models and will investigate which models perform better in a given situation as well as compare them with the results obtained without clustering (from Chapter 4). 5.2 Motivation Several studies have applied SRMs/VDMs to estimate times between public reports of vulnerabilities [3], [6]–[8], [17], [18]. In all the studies we are aware of, curve fitting or/and prediction capabilities were estimated using all the vulnerabilities together. We postulate that such analysis may miss some trends that apply to separate categories of vulnerabilities, rather than all the vulnerabilities together. Moreover, SRMs assume vulnerability detection to be an independent process. However, this process might not be independent due, for example, to the discovery of a new type of vulnerability that might prompt attackers to look for similar vulnerabilities [5]. This assumption may lead to sub-optimal predictions on the next reporting date of a vulnerability, or the total number of new vulnerabilities reported in 52 the next time interval. One way to mitigate these issues is to split vulnerabilities into separate clusters and ensure that the clusters are independent. In this chapter, we make the following contributions: - We present an approach that uses existing clustering techniques to group vulnerabilities into distinct clusters, leveraging the textual information reported in these vulnerabilities as a basis for constructing the clusters. - Our approach uses existing VDMs to make predictions on the number of new vulnerabilities that will be discovered in a given time period for each cluster for a given OS/web browser; - Our approach also superposes the VDMs used for each cluster together into a single model for predicting the number of vulnerabilities that will be discovered in a given time period for a given OS/web browser - We show that, based upon our findings from prediction results, comparing modeling strategies, the results with clustering were more accurate by performing better that without clustering approach in 58% of the cases. 5.3 Data Processing For each software, we included all the vulnerabilities reported for any of its versions. For instance, all the vulnerabilities reported for mac_os, mac_os_server, mac_os_x, and mac_os_x_server were put together to create a vulnerability database for Mac. To prepare the data for the clustering phase, we used text information within vulnerabilities reports to label the vulnerabilities. The keywords for labelling (e.g., denial, injection, buffer, execute) were extracted from these reports. Tables 43 and 50 53 in Appendix A show the total number of vulnerabilities as well as the number of labelled and non-labelled (vulnerabilities without any associated text information in the database) vulnerabilities for the datasets (OSs and web browsers). For the labelled vulnerabilities, we indicate the number and proportion of vulnerabilities associated with a specific keyword. Note that vulnerabilities can be labelled with more than one keyword. For cluster analysis, we need to ensure that the features (keywords) are not correlated. Therefore, we checked the Pearson correlation coefficient for every two keywords per dataset. When we found statistically significant correlation, we merged the correlated keywords with a title which included both terms. For instance, due to the high correlation of .99 (p-value<0.001, 𝐻0: 𝜌 = 0) between “Execute” and “Code” for all the datasets, these terms were treated as “Execute Code”. The same applied for the keywords “SQL” and “Injection”. No other significant correlation was observed. Figure 7 shows the diagram of our clustering approach. Diagram of the presented clustering approach 54 5.4 Clustering Method We used the HPCLUS (High Performance Clustering) procedure in SAS 9.4 with the k-means and k-modes algorithms for clustering nominal input variables. This procedure uses the least square method in k-means to compute cluster centroids. Each iteration reduces the criterion (e.g., the least squared criterion for Euclidean distance) until convergence is achieved or the maximum iteration number is reached [63]. Additionally, we set our method to cluster the data based upon the associated principal component analysis (PCA) scores derived from the linear combinations of binary attributes for each dataset. PCA reduces the number of features that might be correlated to independent linear combinations of them [40]. To estimate the best number of clusters the aligned box criterion (ABC) method was used. Among other existing methods, the cubic clustering criterion (CCC) is a common metric which is usually used in clustering applications to find the most suitable number of clusters [64]. In addition, Tibshirani et al. [65] introduced a gap statistics method, which leverages Monte Carlo simulation for finding the best number of clusters in a database. However, it has been found that the ABC method improves the CCC and gap statistics methods by leveraging a high-performance machine- learning based analysis structure [63]. Within-cluster dispersion is used as an error measure (also called a ‘Gap’) by the ABC method [65]. In order to find the best number of clusters, we applied the ABC method and compared the calculated Gap values over a range of possible k values. The best number of clusters occurs at the maximum peak value in Gap (k) [63]. 55 After clustering, the most frequent keywords were selected to name the clusters with respect to the keywords’ weights information provided in Appendix A. We assumed that the keywords which covered at least 60% of vulnerabilities in each cluster can be good representatives of relative clusters. If none of the keywords reached the weight threshold of 0.6 in a cluster associated with a given software, the keyword with greatest weight was selected as the cluster’s label. 5.4.1 Operating systems We obtained 6, 6, 7, 7 clusters for Windows, Mac, IOS, and Linux, respectively. The list of keywords associated with each of the six / seven clusters for Windows, Mac, IOS, and Linux provided in Appendix A. Since none of the keywords reaches the weight threshold of 0.6 in the fifth cluster associated with Mac, the keyword with the greatest weight (Execute Code) was selected as the cluster’s label. Table 49 shows the cluster summaries for the OSs. All the OSs have one cluster with a similar name. There are also similarly named clusters within some OSs. However, after having analyzed their linear correlation, we did not find any significant relationship based on the Pearson correlation test. 5.4.2 Web browsers Following the same clustering approach we explained for the OSs, we obtained the following number of clusters: Internet Explorer (5), Safari (3), Firefox (5), and Chrome (5). More details about the clusters and frequency of the keywords associated with are provided in Appendix A. The cluster summaries for the browsers are shown in Table 56. 56 5.5 Analysis In repairable systems with only one type of failure, the intensity function ω(𝑡)=𝑑𝐸[Ω(𝑡)]/𝑑𝑡 is often assumed to be a monotonic function of t. Similarly, for most SRMs and VDMs, the intensity function (the detection rate of software errors/vulnerabilities) is considered to be a monotonic function [56]. Let us expand the discussion for a software when there exists more than one type of error. When any type of error independently causes the software normal function to be compromised, then the superposition model represents the software failures. Let us assume that we are dealing with vulnerabilities classified into independent clusters. Considering a given model (NHPP Power-law or a distribution- based VDM), let 𝛺𝑗(𝑡) denote the mean cumulative number of vulnerabilities from the 𝑗𝑡ℎ cluster in (0 t], with intensity function 𝜔𝑗(𝑡|𝛼𝑗 , 𝛽𝑗) where the function form of 𝜔𝑗(𝑡|𝛼𝑗 , 𝛽𝑗) is given and the values of the parameters 𝛼𝑗, 𝛽𝑗are unknown. It is assumed that the number of vulnerabilities from any 𝑗𝑡ℎ cluster 𝛺𝑗(𝑡), 𝑗 = 1,2, … 𝐽 is 𝐽 independent. A process 𝛺(𝑡) = ∑𝑗=1 𝛺𝑗(𝑡), which counts the total number of vulnerabilities in the interval (0 t] for the superposition model, is also a same-type model (NHPP Power-law/distribution-based model) with an intensity function 𝜔(𝑡|𝛼, 𝛽) = 𝜔1(𝑡|𝛼1, 𝛽1) + ⋯ + 𝜔𝐽(𝑡|𝛼𝐽, 𝛽𝐽), where 𝛼 = {𝛼1, … , 𝛼𝐽} , 𝛽 = { 𝛽1, … , 𝛽𝐽}. Since the superposition model keeps its type (all intensity functions are of the same type), the associated superposition model can be applicable [56]. For instance, considering the NHPP Power-law model, the equations become: 57 𝛼 −1 𝛼𝑗 𝑡 𝑗 𝛼𝑗𝑡 𝛼𝑗−1 𝜔𝑗(𝑡|𝛼𝑗 , 𝛽𝑗) = ( ) = 𝛼 (20) 𝛽 𝛽 𝛽 𝑗𝑗 𝑗 𝑗 𝐽 𝑡 𝛺(t) = ∫ ∑ 𝜔𝑗(𝑡|𝛼𝑗 , 𝛽𝑗) 𝑑𝑡 , 𝛼𝑗 > 0, 𝛽𝑗 > 0 (21) 0 𝑗=1 In this chapter, we investigate the assessment results when relaxing the monotonicity assumption of the intensity function that is prevalent in SRMs and VDMs. Selecting a given VDM/SRM, we considered two approaches for each model. The first approach uses non-clustered data (including only the labeled vulnerabilities) (cf. Chapter 4). The second approach is the superposition of the same model fitted to the clustered data (only the labelled vulnerabilities can be used to create the clusters), which relaxes the monotonicity assumption of the intensity function. For each software, the analysis was done in two steps. First, we used the training data (note: the variable we used is the total number of vulnerabilities detected on 30 days’ time interval) to find the model parameters from the process of fitting models to the data (clustered and non-clustered). In other words, similar to what we described in Chapter 4, for the vulnerability data in each cluster, we divided the time axis into 30- day intervals (t=0 is associated with the vulnerability with the earliest published date), and counted the cumulative frequency of vulnerabilities detected in each interval. Non- homogeneity of the clusters was also validated by looking at Laplace-trend test results provided by MiniTab 16 to see whether there were meaningful trends in clusters. Second, we used the estimated parameters and the models, and simulated corresponding mean cumulative function (MCF) (one MCF for clustered data, and one 58 for non-clustered data) over the study period. For non-clustered data, we used the results from Chapter 4. When comparing the models with clustering versus those without-clustering, for each software, the best approach is highlighted in green. In addition, for prediction results, the negative AB values are colored in red (i.e., the associated model underestimated the results). All the accuracy metrics, in terms of curve fitting/prediction, are the same than in Chapter 4. For the models that used the clustered data, the first and second best models are highlighted in yellow. Overall, green is used to indicate the best modeling approach while yellow indicates the best models among the models when applying clustering. 5.6 Curve-Fitting Results In this section we provide the results regarding curve-fitting capabilities of the models (comparing the results between clustered data and non-clustered data, and comparing the results generated from clustering). 5.6.1 Operating systems Tables 7-10 contain the Chi-square goodness of fit test for the clustering-based MCF and the MCF without clustering, the values of R^2, RMSE and HH for the vulnerabilities of the four operating systems in our study. For Windows and IOS, the MCF without clustering yielded more accurate results in all the cases with p- values>0.05 because of smaller RMSE and HH values. For Mac, only in one case, the MCF with clustering performed better than the MCF without clustering. For Linux, the MCF without clustering yielded more accurate results in two cases. Besides, among the 59 models uses non-clustered data and clustered data, in five cases neither of the MCFs were statistically sound. We cannot compare these cases and call them as “invalid” cases. In addition, among the models with clustering, for Windows, the Gamma and Weibull-based VDMs provide the best fits since they have smaller HH and RMSE values. For Mac and IOS, the Gamma-based VDM leads to HH and RMSE values lower than other VDMs. For Linux, the AML and Normal-based VDMs led to the smallest and relatively equal HH values and provide the most accurate fits. Table 7: CURVE FITTING ACCURACY FOR WINDOWS Estimation With clustering Without clustering Windows p-value R-sq RMSE HH p-value R-sq RMSE HH Gamma 0.233 0.983 63.550 0.134 0.686 0.998 38.549 0.033 Weibull 0.233 0.981 63.996 0.135 0.936 0.998 33.010 0.028 AML 0.074 0.982 64.949 0.137 0.956 0.999 30.643 0.026 Normal 0.074 0.982 64.845 0.137 0.956 0.999 30.704 0.026 Power-law 0.098 0.990 101.389 0.202 0.466 0.994 59.757 0.051 RE 0.000 0.980 101.425 0.220 0.026 0.983 102.871 0.088 RQ 0.000 0.980 102.100 0.278 0.888 0.995 58.298 0.050 YF 0.074 0.982 64.845 0.137 0.200 0.999 27.002 0.023 Table 8: CURVE FITTING ACCURACY FOR MAC Estimation With clustering Without clustering Mac p-value R-sq RMSE HH p-value R-sq RMSE HH Gamma 0.058 0.985 118.583 0.236 0.193 0.993 52.139 0.061 Weibull 0.058 0.984 121.376 0.242 0.703 0.991 57.979 0.068 AML 0.000 0.986 121.620 0.244 0.001 0.988 67.885 0.080 Normal 0.000 0.986 120.570 0.241 0.001 0.987 68.033 0.080 Power-law 0.072 0.987 131.265 0.247 0.058 0.984 76.679 0.090 RE 0.059 0.986 129.168 0.245 0.000 0.968 108.229 0.128 RQ 0.000 0.890 150.230 0.360 0.002 0.988 67.651 0.080 YF 0.114 0.986 121.621 0.244 0.193 0.990 60.709 0.071 Table 9: CURVE FITTING ACCURACY FOR IOS Estimation With clustering Without clustering IOS p-value R-sq RMSE HH p-value R-sq RMSE HH Gamma 0.170 0.992 399.504 1.130 0.981 0.995 9.727 0.054 Weibull 0.175 0.993 404.167 1.138 0.981 0.995 9.379 0.052 AML 0.076 0.990 450.814 1.224 0.990 0.997 7.391 0.041 Normal 0.076 0.991 458.197 1.237 0.990 0.997 7.403 0.041 Power-law 0.055 0.990 405.189 1.140 0.642 0.995 9.357 0.052 RE 0.000 0.987 410.026 1.149 0.150 0.998 9.044 0.053 RQ 0.000 0.987 450.159 1.219 0.000 0.973 22.993 0.128 YF 0.098 0.991 408.237 1.140 0.370 0.998 6.334 0.035 60 Table 10: CURVE FITTING ACCURACY FOR LINUX Estimation With clustering Without clustering Linux p-value R-sq RMSE HH p-value R-sq RMSE HH Gamma 0.371 0.994 166.381 0.167 0.640 0.985 105.371 0.085 Weibull 0.371 0.991 129.92 0.134 0.640 0.985 105.264 0.085 AML 0.560 0.990 104.775 0.110 0.630 0.977 129.095 0.105 Normal 0.560 0.994 105.670 0.110 0.630 0.977 129.338 0.105 Power-law 0.392 0.990 128.021 0.148 0.640 0.985 105.234 0.085 RE 0.075 0.992 145.390 0.155 0.006 0.983 110.346 0.089 RQ 0.171 0.992 147.008 0.156. 0.222 0.985 105.901 0.086 YF 0.151 0.994 128.620 0.149 0.006 0.982 114.012 0.092 5.6.2 Web browsers Tables 11-14 contain the Chi-square goodness of fit test values for the clustering-based MCF and the MCF without clustering, the values of R^2, RMSE and HH for the vulnerabilities of the four web browsers in our study. For all the cases with p-values>0.05, the MCF without clustering led to more accurate results than those from clustering-based MCFs, because of having smaller RMSE and HH values as well as having statistically sound p-values. Besides, among the models uses non-clustered data and clustered data, in three cases neither of the MCFs were statistically sound. In addition, among the models with clustering, for IE, the Power-law model led to the smallest values of HH and RMSE. For Safari, the AML and Normal-based VDMs provided better fits than other models. For Firefox, Gamma and Weibull-based VDMs are the best models since they have smaller fitting errors (HH and RMSE) than other models. For Chrome, Gamma-based VMD provided the best fit due to smaller HH and RMSE values. 61 Table 11: CURVE FITTING ACCURACY FOR IE Estimation With clustering Without clustering IE p-value R-sq RMSE HH p-value R-sq RMSE HH Gamma 0.169 0.997 65.088 0.342 0.624 0.977 50.137 0.098 Weibull 0.469 0.997 64.970 0.341 0.624 0.978 50.051 0.098 AML 0.012 0.997 69.876 0.372 0.403 0.975 53.035 0.104 Normal 0.012 0.997 69.882 0.372 0.403 0.975 53.151 0.104 Power-law 0.162 0.973 56.496 0.285 0.624 0.978 50.041 0.098 RE 0.067 0.997 69.872 0.368 0.403 0.983 53.307 0.099 RQ 0.000 0.990 62.805 0.285 0.624 0.978 59.149 0.099 YF 0.529 0.997 65.079 0.341 0.285 0.983 44.100 0.086 Table 12: CURVE FITTING ACCURACY FOR SAFARI Estimation With clustering Without clustering Safari p-value R-sq RMSE HH p-value R-sq RMSE HH Gamma 0.057 0.884 125.658 0.878 0.245 0.993 17.915 0.056 Weibull 0.057 0.880 125.476 0.878 0.739 0.993 18.945 0.059 AML 0.080 0.883 98.623 0.763 0.050 0.991 21.297 0.066 Normal 0.080 0.810 96.776 0.769 0.050 0.991 21.365 0.066 Power-law 0.080 0.92 129.947 0.887 0.378 0.984 27.731 0.086 RE 0.052 0.91 125.250 0.884 0.012 0.969 38.621 0.121 RQ 0.039 0.95 130.402 0.890 0.549 0.984 27.413 0.085 YF 0.000 0.880 120.714 0.860 0.986 0.992 19.629 0.061 Table 13: CURVE FITTING ACCURACY FOR FIREFOX Estimation With clustering Without clustering Firefox p-value R-sq RMSE HH p-value R-sq RMSE HH Gamma 0.294 0.998 22.944 0.056 0.378 0.998 18.683 0.028 Weibull 0.294 0.998 23.704 0.058 0.307 0.998 18.826 0.029 AML 0.115 0.996 32.155 0.080 0.250 0.993 34.519 0.053 Normal 0.115 0.996 32.185 0.080 0.250 0.993 34.630 0.053 Power-law 0.053 0.990 31.791 0.078 0.307 0.998 20.032 0.031 RE 0.054 0.998 31.850 0.079 0.115 0.992 36.779 0.056 RQ 0.000 0.996 45.291 0.155 0.193 0.996 24.914 0.038 YF 0.113 0.990 32.001 0.080 0.150 0.996 24.543 0.037 Table 14: CURVE FITTING ACCURACY FOR CHROME Estimation With clustering Without clustering Chrome p-value R-sq RMSE HH p-value R-sq RMSE HH Gamma 0.137 0.998 57.890 0.123 0.368 0.998 20.094 0.036 Weibull 0.137 0.998 60.450 0.128 0.368 0.994 32.551 0.059 AML 0.385 0.997 62.917 0.134 0.690 0.995 29.213 0.053 Normal 0.385 0.997 62.913 0.134 0.690 0.995 29.415 0.053 Power-law 0.106 0.978 82.021 0.174 0.240 0.962 83.910 0.153 RE 0.000 0.998 115.135 0.291 0.000 0.937 107.974 0.199 RQ 0.000 0.996 81.852 0.170 0.000 0.973 70.106 0.127 YF 0.280 0.978 60.455 0.128 0.400 0.996 25.641 0.046 62 5.6.3 Summary of Curve-Fitting Results Overall, in terms of curve-fitting, considering the OSs, the models that were using non-clustered data performed better in 24 cases out of 27 valid cases (the valid cases are those with at least one statistically sound MCF). Considering web browsers, again the non-clustering approach led to more accurate results in all the valid cases (29 cases). Considering the OSs and the web browsers together (8 cases), out of 56 valid cases (8 datasets * eight models per dataset – invalid models) we analyzed, in terms of estimation, the approach without-clustering led to more accurate results in 53 (94.6%) cases. Comparing with-clustering results together, in terms of curve fitting, out of eight datasets, the Gamma-based VDMs was best in five (62.5%) cases. The Weibull, AML, and Normal VDMs each were equally best in two (25%) cases. The Power-law model was most accurate in one (12.5%) case. Therefore, based upon our findings from estimation results, comparing modeling strategies, the results from the non-clustered approach were more accurate than those from the clustered approach. However, when comparing clustering results together, the Gamma-based VDM was the most accurate model given all the datasets we had. 5.7 Prediction Results In this section, we provide the results regarding prediction capabilities of the models (comparing the obtained predictions with clustered data and non-clustered data, and comparing the results generated from the clustering approach). For each case, the model that has the smallest value of AE and p-value≥0.05 was selected as having the 63 best prediction capability and is highlighted in green. Regarding p-values, we used * to show the models with p<0.05. 5.7.1 Operating systems Tables 15-18 present the values of AE, AB, R2 and HH for the four operating systems in our study. Eight models per software were analyzed. There are five cases where neither of the MCFs were statistically sound (five invalid cases). For Windows, in all the valid VDMs but RQ, the MCFs with clustering led to more accurate results compared to the MCFs without clustering. For Mac, in four cases out of five valid cases, the MCF with clustering led to more accurate results. For IOS, the MCFs without clustering led to more accurate results in all the valid cases. For Linux, for all the models except Gamma-based VDM, the MCFs with clustering resulted in the most accurate prediction results. In addition, comparing with-clustering, for Windows and Mac, the AML VDM and the YF VDM have the smallest values of AB, AE and HH, respectively. For IOS, the Weibull-based VDM and the Power-law model have the smallest AE values. However, the HH values show that the Power-law model should be selected as the first best model. For Linux, Weibull-based VDM has the best results. Note that negative values of AB indicate that the model may underestimate the total number of discovered vulnerabilities. 64 Table 15: PREDICTION ACCURACY FOR WINDOWS Prediction With clustering Without clustering Windows AE AB R-sq HH AE AB R-sq HH Gamma 0.037 -0.037 0.96 0.060 0.063 -0.061 0.94 0.095 Weibull 0.035 -0.045 0.97 0.049 0.091 -0.091 0.94 0.131 AML 0.015 -0.015 0.99 0.018 0.138 -0.138 0.88 0.187 Normal 0.021 -0.021 0.99 0.035 0.138 -0.138 0.88 0.187 Power-law 0.035 -0.031 0.84 0.046 0.037 0.025 0.90 0.040 RE 0.021* -0.019 0.86 0.035 0.106* 0.106 0.86 0.100 RQ 0.033* -0.040 0.98 0.040 0.039 0.030 0.95 0.042 YF 0.036 -0.032 0.96 0.050 0.114 -0.114 0.94 0.158 Table 16: PREDICTION ACCURACY FOR MAC Prediction With clustering Without clustering Mac AE AB R-sq HH AE AB R-sq HH Gamma 0.080 -0.041 0.97 0.175 0.218 -0.218 0.99 0.263 Weibull 0.061 -0.027 0.98 0.265 0.233 -0.233 0.99 0.287 AML 0.032* 0.0194 0.99 0.234 0.278* -0.278 0.97 0.351 Normal 0.153* -0.130 0.98 0.309 0.278* -0.278 0.97 0.351 Power-law 1.059 1.059 0.58 0.786 0.074 -0.074 0.94 0.080 RE 0.119 0.890 0.86 0.279 0.024* 0.017 0.97 0.037 RQ 0.270* -0.202 0.89 0.361 0.082* -0.082 0.86 0.090 YF 0.035 -0.035 0.98 0.236 0.256 -0.256 0.99 0.320 Table 17: PREDICTION ACCURACY FOR IOS Prediction With clustering Without clustering IOS AE AB R-sq HH AE AB R-sq HH Gamma 0.260 -0.181 0.95 0.367 0.018 0.000 0.97 0.025 Weibull 0.259 -0.179 0.95 0.379 0.019 0.005 0.98 0.027 AML 0.502 -0.501 0.87 0.894 0.076 0.076 0.98 0.088 Normal 0.598 -0.598 0.74 1.142 0.076 0.076 0.95 0.088 Power-law 0.259 -0.179 0.90 0.365 0.019 0.006 0.87 0.028 RE 0.492* -0.492 0.87 0.853 0.131 0.131 0.86 0.148 RQ 0.492* -0.318 0.84 1.142 0.154* -0.154 0.86 0.172 YF 0.263 -0.155 0.95 0.373 0.092 0.092 0.99 0.108 Table 18: PREDICTION ACCURACY FOR LINUX Prediction With clustering Without clustering Linux AE AB R-sq HH AE AB R-sq HH Gamma 0.274 -0.274 0.95 0.398 0.268 -0.268 0.94 0.354 Weibull 0.032 0.013 0.99 0.032 0.267 -0.267 0.74 0.352 AML 0.142 0.091 0.98 0.132 0.272 -0.272 0.78 0.366 Normal 0.091 0.063 0.99 0.082 0.272 -0.272 0.78 0.366 Power-law 0.037 0.037 0.83 0.037 0.267 -0.267 0.70 0.352 RE 0.174 -0.174 0.85 0.398 0.190* -0.190 0.95 0.239 RQ 0.274 -0.274 0.90 0.398 0.278 -0.278 0.74 0.369 YF 0.083 0.083 0.98 0.074 0.240* -0.240 0.85 0.313 65 5.7.2 Web browsers Tables 19-22 present the values of AE, AB, R2 and HH for the four web browsers in our study. Again, eight models per software were analyzed. There are two cases where neither of the MCFs were statistically sound (two invalid cases). For IE, for all the VDMs but not for the Power-law model, the MCFs without clustering led to more accurate results compared to the MCFs without clustering. For Safari, in all the VDMs but the Power-law model and RE VDM, the MCFs without clustering led to most accurate results. For Firefox, for all the VDMs but not for the RQ VDM, the MCFs with clustering led to most accurate results. For Chrome, all the MCFs with clustering led to the more accurate results compared to the MCFs without clustering. In addition, comparing with-clustering, for IE, Safari, and Firefox, the Power- law model had the smallest error values. For Chrome, the Gamma-based VDM had the smallest error values. Table 19: PREDICTION ACCURACY FOR IE Prediction With clustering Without clustering IE AE AB R-sq HH AE AB R-sq HH Gamma 0.240 -0.230 0.99 0.302 0.234 -0.234 0.97 0.273 Weibull 0.280 -0.280 0.99 0.329 0.233 -0.233 0.98 0.272 AML 0.294* -0.294 0.99 0.347 0.157 -0.157 0.99 0.175 Normal 0.295* -0.295 0.99 0.347 0.157 -0.157 0.99 0.175 Power-law 0.108 0.074 0.89 0.184 0.233 -0.233 0.89 0.271 RE 0.276 -0.276 0.97 0.326 0.149 -0.149 0.92 0.164 RQ 0.308* -0.308 0.97 0.372 0.232 -0.232 0.88 0.270 YF 0.240 -0.230 0.99 0.302 0.141 -0.141 0.97 0.155 Table 20: PREDICTION ACCURACY FOR SAFARI Prediction With clustering Without clustering Safari AE AB R-sq HH AE AB R-sq HH Gamma 0.609 -0.609 0.79 1.126 0.156 -0.156 0.94 0.201 Weibull 0.611 -0.611 0.79 1.134 0.187 -0.187 0.94 0.245 AML 0.656 -0.656 0.67 1.278 0.231 -0.231 0.88 0.304 Normal 0.656 -0.656 0.68 1.280 0.231 -0.231 0.88 0.304 Power-law 0.023 -0.018 0.90 0.026 0.030 0.026 0.90 0.037 RE 0.640 -0.640 0.70 1.248 0.133* 0.133 0.88 0.141 RQ 0.640 -0.566 0.85 1.251 0.041 0.040 0.99 0.047 YF 0.616* -0.612 0.83 1.140 0.211 -0.211 0.86 0.278 66 Table 21: PREDICTION ACCURACY FOR FIREFOX Prediction With clustering Without clustering Firefox AE AB R-sq HH AE AB R-sq HH Gamma 0.017 -0.017 0.99 0.026 0.051 0.035 0.99 0.066 Weibull 0.020 -0.020 0.99 0.029 0.049 0.031 0.99 0.064 AML 0.024 -0.024 0.98 0.023 0.081 -0.081 0.99 0.112 Normal 0.034 -0.034 0.95 0.032 0.081 -0.081 0.98 0.112 Power-law 0.015 0.014 0.99 0.015 0.069 0.067 0.98 0.089 RE 0.034 -0.034 0.96 0.030 0.161 0.161 0.89 0.174 RQ 0.075* -0.054 0.86 0.032 0.096 0.096 0.99 0.113 YF 0.021 -0.021 0.99 0.022 0.051 -0.032 0.99 0.069 Table 22: PREDICTION ACCURACY FOR CHROME Prediction With clustering Without clustering Chrome AE AB R-sq HH AE AB R-sq HH Gamma 0.173 -0.173 0.99 0.167 0.281 -0.281 0.99 0.367 Weibull 0.225 -0.225 0.98 0.215 0.317 -0.317 0.99 0.422 AML 0.232 -0.232 0.99 0.222 0.307 -0.307 0.99 0.405 Normal 0.232 -0.232 0.99 0.222 0.307 -0.307 0.99 0.405 Power-law 0.324 0.324 0.94 0.345 0.167 0.167 0.94 0.191 RE 0.298* -0.298 0.86 0.305 0.364* 0.364 0.89 0.383 RQ 0.334* 0.324 0.94 0.345 0.077* 0.077 0.99 0.102 YF 0.225 -0.225 0.98 0.213 0.304 -0.304 0.95 0.402 5.7.3 Summary of Prediction Results In terms of prediction accuracy, considering the OSs, out of 27 valid cases (4 OS and 8 models excluding invalid cases), the MCFs that used clustered data led to more accurate results in 17 (63%) cases than those which used non-clustered data. Considering web browsers, the MCFs with-clustering has most accurate results in 16 (53.3%) cases, out of the 30 valid cases we analyzed. However, considering the OSs and web browsers together, out of 57 valid cases analyzed (8 software and 8 models excluding 7 invalid cases), in terms of prediction accuracy, the MCFs with clustering approach led to more accurate results in 33 (58%) cases. Overall, in terms of prediction accuracy, out of 57 valid cases (eight datasets multiplied by eight models per dataset minus seven invalid cases), the MCF with 67 clustering led to more accurate results in 33 (58%) cases, while the MCF without clustering resulted in more accurate results in 24 (42%) cases. Comparing with-clustering results together, in terms of prediction, out of eight datasets, the Power-law model and the Weibull-based VDM was the most accurate in four (50%) and two (25%) cases, respectively. The Gamma-based, AML and YF VDMs each were equally best in one (12.5%) cases. The other VDMs was best in neither of the cases. Therefore, based upon our findings from prediction results, comparing modeling strategies, the results with clustering were more accurate for eight datasets we analyzed. Comparing with-clustering results together, in terms of prediction accuracy, the Power-law model was the most accurate model. Please remember that all the conclusions we draw from this research and their validity are limited by our database uncertainties. 5.8 Discussion In this chapter, we explored the applicability of clustering to vulnerability data, and investigated whether this approach can lead to more accurate predictions with common SRMs/VDMs. Our results show that the cluster-based approach provided better prediction results than without clustering for our datasets. We have summarized our findings in the guidelines presented in Table 23: 68 Table 23: MODELING GUIDELINE Approach Without Clustering With- Clustering fitting prediction fitting prediction Model Gamma Power-law Gamma Power-law 5.9 Limitations The main limitations of this chapter are the following:  One limitation is with regard to the uncertainty of the databases we used. Vulnerability databases usually might have some uncertainty regarding variables associated with the reported vulnerabilities such as published dates.  We have only applied the approach to 8 software. We don’t know yet how well this works for other vulnerability datasets associated with a software, though we plan to extend this work in the future. Since we integrated all the vulnerabilities associated with multiple versions of a software and it might have triggered some sources of dependency, we plan to rebuild the datasets we used. In other words, for a given software, we will integrate the versions in which their source codes have reached a threshold of similarity. We will define a similarity metric.  We have applied the approach to all the vulnerabilities of a software, rather than subdivided by version type. This was mainly because the sample size of vulnerabilities get much smaller when considering individual versions. Other researchers have looked at individual versions separately [24]. However, we 69 believe that different versions of a software cannot be assumed completely independent since different versions have large overlaps in their code base. 5.10 Summary In this chapter, we used a common clustering technique to group the vulnerabilities into distinct clusters, using the textual information reported in these vulnerabilities. We have applied our approach on the vulnerabilities datasets introduced in Chapter 3. We also investigated whether this approach could result in more accurate results, in terms of estimation/prediction, compared to the case where all the vulnerabilities estimated together (results from Chapter 3). In next chapter, we will investigate different vulnerability grouping approaches. 70 Chapter 6: A Comparison of Vulnerabilities’ Grouping Strategies 6.1 Introduction In this chapter, we present some guidelines to model vulnerability discovery data based on two commonly employed vulnerabilities’ grouping strategies. In the first strategy, for each software, we analyze all vulnerabilities reported for any of its versions. In the second strategy, for each software, we select only versions for which there are most vulnerabilities shared by two subsequent versions of the product. We used the eight models introduced in Chapter 3 (eight common VDMs) for the discovery process of vulnerabilities in the eight well-known software (four operating systems and four web browsers) also introduced in Chapter 3. The accuracy of these models was investigated based on their fitting and prediction capabilities. 6.2 Motivation Many studies choose all vulnerabilities when assessing product families. This may be difficult to justify for products with long lifespans: some of the older vulnerabilities have long been fixed, and the code-based of these products is likely to have evolved significantly. Other studies [24], [32] asses specific versions of products and only consider vulnerabilities that have been reported for those specific versions. This may be too restrictive as a subsequent version of the same product family is likely to share a large 71 proportion of the code base with its predecessor. Hence it is also likely to share a large proportion of the vulnerabilities. Our research question investigates whether we can improve the results by filtering a given vulnerability dataset associated with a product with only versions for which there are most vulnerabilities shared by two subsequent versions of the product. In other words, the question is “should we only consider the vulnerabilities associated with a single version of a software for our modeling or all vulnerabilities reported for any of its versions?”. To answer these questions, we need to consider each scenario separately and compare their results. In this chapter, we make the following contributions: - We use two strategies for grouping vulnerabilities (vulnerabilities merged for all versions and groups built based on a number of common vulnerabilities across versions). - We apply eight common VDMs on the mentioned groups and compare their curve-fitting and prediction capabilities to derive a guideline. 6.3 Grouping Strategy As mentioned is Chapter 3, we will analyze the reported vulnerabilities associated with four well-known OSs: Windows (1995-2017), Mac (1997-2017), IOS (the OS associated with Cisco) (1992-2017), and Linux (1994-2017), as well as four well-known web browsers including Internet Explorer (1997-2017), Safari (2003- 2017), Firefox (2003-2017), and Chrome (2008-2017). These software have been selected because they are the most widely used and have the most vulnerabilities among the databases. 72 For each software, we considered two grouping strategies. In the first strategy (St. 1), for each software, we analyze all vulnerabilities reported for any of its versions. Thus, for each software, all the vulnerabilities reported for any of its versions were included. For instance, all the vulnerabilities reported for mac_os, mac_os_server, mac_os_x, and mac_os_x_server were put together to create a vulnerability database for Mac. In the second strategy (St. 2), for each software, we group versions based on the percentage of common vulnerabilities (i.e., we group the consecutive versions with more than 70% common reported vulnerabilities). We assume the percentage of common vulnerabilities for two or more consecutive versions of a software be a good measure regarding the similarity of their source code. We selected 70% as a threshold for versions of high level of source code similarity that should be analyzed together. Table 24 shows the percentage of common vulnerability within some versions of Firefox. The threshold is met for versions 0.x, 1.x, 2.x and their associated vulnerabilities can be grouped together. Even though the threshold is also met for versions 0.x and 3.x, we do not consider them as a separate group since they are not consecutive versions. For the software versions that don’t satisfy this condition we only consider the product with the most vulnerabilities reported (e.g., Linux_Kernel). Table 24: PERCENTAGE OF COMMON VULNERABILITIES WITHIN FIREFOX VERSIONS Versions (% of Firefox Firefox Firefox Firefox common vuls.) 0.x 1.x 2.x 3.x Firefox 0.x 1 0.94 0.79 0.71 Firefox 1.x * 1 0.70 0.52 Firefox 2.x * * 1 0.63 Firefox 3.x * * * 1 73 Table 25 presents the total number of vulnerabilities for each software (All versions together (St. 1) and versions grouped based on the similarity threshold (St. 2)) as well as their skewness values. All the datasets associated with the eight software we analyzed are right skewed (each has a positive skewness value with an absolute value of skewness greater than 0.5). Safari was not considered in St. 2 since we didn’t have enough vulnerabilities for modeling by grouping the versions with more than 70% common vulnerabilities. Table 25: NUMBER OF VULNERABILITIES PER SOFTWARE OS (St.1) Windows Mac IOS Linux # Vulnerabilities 3100 2705 650 4745 Skewness 2.10 2.19 3.12 3.50 Web Browser IE Safari Firefox Chrome (St.1) # Vulnerabilities 1775 943 1477 1837 Skewness 2.65 2.98 10.00 1.63 Windows IOS Mac Linux OS (St.2) Vista & 11.x & OS_X Kernel 7 12.x # Vulnerabilities 1054 1907 317 1993 Skewness 1.52 0.63 0.51 2.79 IE Firefox Web Browser IE (5.x, IE (9.x, (10.x, (0.x, (St.2) 6.x) 10.x) 11.x) 1.x, 2.x) # Vulnerabilities 667 720 677 882 Skewness 1.07 2.34 2.50 8.05 Chrome Chrome Web Browser (0.x, 1.x, (3.x, (St.2) 2.x) 4.x) # Vulnerabilities 878 959 Skewness 0.86 1.35 6.4 Analysis Like what we described in Chapter 4, for the vulnerability data in each group, we divided the time axis into 30-day intervals (t=0 is associated with the vulnerability with the earliest published date), and counted the cumulative frequency of 74 vulnerabilities detected in each interval. The regression and the analysis methods we used described in Section 4.3. 6.4.1 Curve-fitting error indicators We used the eight models for the discovery process of vulnerabilities introduced in Chapter 3 in eight well-known software (four operating systems and four web browsers). The models were fitted to the 18 datasets (8 datasets from St. 1 and 10 datasets from St. 2) using a non-linear regression method described in [24]. Most of the error indicators considered in this chapter are the same than in previous chapters (χ2, HH). However, we added a few more indicators for better judgment between the models. The Akaike Information Criteria (AIC) is frequently used to make a fair comparison between models. It also accounts for overfitting detection by penalizing the models with an issue of overfitting whereas metrics like χ2 don’t have such capability. AIC is formally defined as: 𝐴𝐼𝐶 = (−2 × 𝑙𝑜𝑔 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑) + 2M (22) where M is the number of free parameters of the examined model. Alhazmi & Malaiya in [18], [21] reported the AIC values. To compare two models, we consider their difference: Δi = AIC𝑖 − AIC𝑚𝑖𝑛 (23) where AIC𝑖 is the AIC of the i-th model, and AIC𝑚𝑖𝑛n is the lowest AIC one obtains among the set of models examined (i.e., the preferred model). The rule of thumb, 75 outlined in [66], is: if Δi<2 , then there is substantial support for the i-th model (or the evidence against it is worth only a bare mention), and the proposition that it is a proper description is highly probable; if 2<Δi<4, then there is strong support for the i-th model; if 4<Δi<7, then there is considerably less support for the i-th model; models with Δi>10 have essentially no support. In this chapter, models with Δi<4 were selected as the best models. 6.4.2 Prediction error indicators In addition to AE and AB, we also report %ΔAE𝑖, which represents the percentage of difference between the AE of the i-the model and the model with minimum AE %ΔAE𝑖 = (AE𝑖 − AE𝑚𝑖𝑛) ∗ 100 (24) where AE𝑖 is the AE of the i-th model, and AE𝑚𝑖𝑛 is the lowest AE obtained among the set of models examined (i.e., the best model). 6.5 Curve-Fitting Results 6.5.1 Operating systems Tables 26-27 contain the χ2 goodness of fit test p-values, HH, the AIC values, and Δi for the OSs (St. 1 and St. 2) and the web browsers (St. 1 and St. 2), respectively. In each case, the model that has the smallest values of HH and AIC is selected as the best fitting model and highlighted in green. The other models with Δi<4 are also selected as the best models. 76 Table 26: CURVE FITTING ACCURACY FOR OSS (ST.1) Windows Mac p-val HH AIC Δi p-val HH AIC Δi Gamma 1.00 0.03 2914.58 0.00 1.00 0.07 6080.39 5.78 Weibull 1.00 0.03 2916.34 1.75 1.00 0.07 6079.86 5.26 AML 1.00 0.05 3083.02 168.43 0.99 0.08 6252.64 178.04 Normal 1.00 0.05 3082.02 167.43 0.99 0.08 6251.64 177.04 Power-law 1.00 0.05 3075.58 161.00 1.00 0.07 6077.50 2.90 RE 1.00 0.08 3381.79 467.21 0.99 0.08 6165.09 90.49 RQ 1.00 0.05 3078.25 163.66 1.00 0.07 6074.60 0.00 YF 1.00 0.04 2967.94 53.36 0.98 0.08 6166.42 91.81 IOS Linux p-val HH AIC Δi p-val HH AIC Δi Gamma 0.25 0.03 2220.06 102.65 0.46 0.119 20506.23 1036.55 Weibull 0.43 0.03 2216.41 99.00 0.78 0.118 20493.64 1023.95 AML 1.00 0.03 2118.41 1.00 0.47 0.108 20240.71 771.02 Normal 1.00 0.03 2117.41 0.00 0.98 0.083 19469.69 0.00 Power-law 0.48 0.03 2213.77 96.36 0.99 0.118 20490.83 1021.14 RE 0.00 0.04 2288.65 171.24 0.99 0.088 19652.27 182.59 RQ 0.00 0.07 2633.48 516.06 0.00 0.134 20856.60 1386.91 YF 1.00 0.03 2152.12 34.71 0.99 0.088 19652.72 183.04 Table 27: CURVE FITTING ACCURACY FOR OSS (ST.2) Windows (Vista & 7) Mac_OS_X p-val HH AIC Δi p-val HH AIC Δi Gamma 0.99 0.07 1830.80 39.84 0.23 0.12 3236.55 77.19 Weibull 0.99 0.07 1829.74 38.78 0.45 0.12 3235.16 75.81 AML 0.00 0.10 1932.50 141.53 0.90 0.11 3188.84 29.49 Normal 0.00 0.10 1931.50 140.53 0.98 0.11 3178.92 19.57 Power-law 0.98 0.07 1827.65 36.69 0.16 0.12 3233.08 73.72 RE 0.99 0.07 1790.96 0.00 1.00 0.10 3159.35 0.00 RQ 0.98 0.07 1810.02 19.06 0.99 0.12 3238.09 78.74 YF 1.00 0.07 1794.40 3.44 0.99 0.10 3164.66 5.31 IOS (11.x & 12.x) Linux_Kernel p-val HH AIC Δi p-val HH AIC Δi Gamma 0.27 0.05 852.90 99.33 0.99 0.05 7325.63 73.01 Weibull 0.72 0.04 827.59 74.01 0.99 0.05 7320.49 67.88 AML 1.00 0.03 754.58 1.00 0.98 0.06 7389.79 137.17 Normal 1.00 0.03 753.58 0.00 0.98 0.06 7388.79 136.17 Power-law 1.00 0.06 923.13 169.55 0.76 0.05 7317.41 64.79 RE 0.00 0.09 997.64 244.06 1.00 0.05 7252.62 0.00 RQ 0.14 0.07 953.12 199.54 0.00 0.06 7568.41 315.79 YF 1.00 0.03 762.78 9.21 1.00 0.05 7275.13 22.52 For OSs (St. 1 and St. 2), in all cases, most of the models are statistically sound with p-values greater than 0.05. RE and RQ VDMs each resulted in unsound p-values in two cases. However, we cannot accept all the models with sound p-values and have to consider their associated Δi values. Comparing grouping strategies, in terms of curve fitting, the AML and Normal-based VDMs were the best models for IOS in both strategies. However, for other OSs, the results associated with St. 1 and St. 2 differ. 77 6.5.2 Web browsers Tables 28-29 contain the χ2 goodness of fit test p-values, HH, the AIC values, and Δi for the web browsers (St. 1 and St. 2) and the web browsers (St. 1 and St. 2), respectively. For web browsers (St. 1), most of the models still have statistically sound p- values. However, this is not true for web browsers (St. 2). Comparing the grouping strategies, there is no common best model within Chrome (St. 1) and Chrome (3.x, 4.x). Comparing grouping strategies, in terms of curve fitting, for Chrome and one of its subversions (Chrome 3.x &4.x), the Gamma-based VDM was the best common model using both strategies. Table 28: CURVE FITTING ACCURACY FOR WEB BROWSERS (ST.1) IE Safari p-val HH AIC Δi p-val HH AIC Δi Gamma 1.00 0.12 4906.01 293.35 1.00 0.06 1598.98 0.00 Weibull 1.00 0.12 4903.48 290.82 1.00 0.07 1610.80 11.82 AML 1.00 0.09 4666.23 53.57 0.99 0.09 1716.48 117.49 Normal 1.00 0.08 4612.66 0.00 0.99 0.09 1715.48 116.49 Power-law 0.99 0.12 4901.34 288.68 1.00 0.07 1629.14 30.16 RE 1.00 0.09 4639.14 26.48 0.99 0.10 1750.82 151.84 RQ 1.00 0.13 4952.64 339.98 1.00 0.07 1632.85 33.87 YF 0.99 0.09 4619.17 6.51 1.00 0.08 1667.18 68.20 Firefox Chrome p-val HH AIC Δi p-val HH AIC Δi Gamma 1.00 0.03 2332.83 3.07 0.47 0.07 2773.10 0.00 Weibull 1.00 0.03 2332.35 2.60 0.43 0.08 2822.16 49.05 AML 0.00 0.05 2606.51 276.76 0.23 0.09 2892.74 119.64 Normal 0.00 0.05 2605.51 275.76 0.20 0.09 2891.74 118.64 Power-law 1.00 0.03 2329.75 0.00 0.24 0.10 2905.96 132.86 RE 0.05 0.04 2496.37 166.61 0.00 0.12 3025.95 252.85 RQ 0.46 0.03 2370.79 41.04 0.34 0.09 2848.45 75.35 YF 1.00 0.03 2334.13 4.38 0.35 0.08 2843.79 70.69 78 Table 29: CURVE FITTING ACCURACY FOR WEB BROWSERS (ST.2) IE (5.x &6.x) IE (9.x &10.x) p-val HH AIC Δi p-val HH AIC Δi Gamma 0.42 0.04 2403.89 0.00 0.00 0.03 714.25 81.52 Weibull 0.82 0.04 2405.72 1.83 0.00 0.03 664.79 32.06 AML 0.00 0.06 2581.77 177.88 0.28 0.02 633.73 1.00 Normal 0.00 0.06 2580.77 176.88 0.26 0.02 632.73 0.00 Power-law 0.00 0.06 2648.43 244.54 0.00 0.11 913.15 280.42 RE 0.00 0.08 2752.95 349.06 0.00 0.13 946.00 313.27 RQ 0.00 0.07 2734.52 330.63 0.00 0.12 933.75 301.02 YF 0.00 0.05 2491.38 87.49 0.00 0.03 714.25 81.52 IE (10.x &11.x) Firefox (0.x , 1.x, 2.x) p-val HH AIC Δi p-val HH AIC Δi Gamma 0.95 0.02 543.12 0.00 0.00 0.08 1710.73 7.76 Weibull 0.65 0.03 551.15 8.04 0.00 0.08 1710.73 7.76 AML 0.85 0.04 598.93 55.81 0.00 0.11 1802.55 99.58 Normal 0.63 0.04 597.93 54.81 0.00 0.11 1801.55 98.58 Power-law 0.00 0.12 769.67 226.56 0.00 0.08 1707.73 4.76 RE 0.00 0.14 796.28 253.16 0.00 0.08 1707.57 4.60 RQ 0.00 0.13 786.27 243.16 0.05 0.08 1706.58 3.60 YF 0.76 0.03 574.28 31.17 0.11 0.08 1702.97 0.00 Chrome (0.x , 1.x, 2.x) Chrome (3.x , 4.x) p-val HH AIC Δi p-val HH AIC Δi Gamma 0.00 0.04 1026.71 146.40 0.07 0.09 1273.70 0.00 Weibull 0.32 0.02 908.83 28.52 0.05 0.09 1275.41 1.71 AML 0.68 0.03 936.37 56.07 0.00 0.13 1357.99 84.29 Normal 0.68 0.03 935.37 55.07 0.00 0.13 1356.99 83.29 Power-law 0.00 0.22 1444.48 564.17 0.00 0.10 1292.40 18.70 RE 0.00 0.24 1464.31 584.00 0.00 0.11 1322.00 48.30 RQ 0.00 0.24 1462.09 581.78 0.04 0.10 1279.97 6.27 YF 1.00 0.02 880.31 0.00 0.04 0.10 1282.88 9.18 6.5.3 Summary of curve-fitting results Overall, in terms of curve fitting, considering the OSs (St. 1), the Normal VDM was selected in two cases out of four. The AML, Weibull, and Gamma-based VDM, the Power-law, and RQ models were equally best model in one case out of four, while the RE model was the best model in three cases out of four. When considering the OSs (St. 2), the AML and Normal-based VDMs, and the YF model each are being selected as the best model in one case out of four (St. 2). However, for other web browsers, the results associated with St. 1 and St. 2 differ. Considering the web browsers (St. 1), the Gamma-based VDM provided the best fit in three out of four cases, whereas the Weibull and Normal-based VDMs along 79 with the Power-law model were each the best model in one of the cases. However, considering the web browsers (St. 2), the Gamma-based VDM was selected as the best model in three out of six cases. The Weibull-based VDM was found the best model in two cases out of six. 6.6 Prediction Results 6.6.1 Operating systems Tables 30-31 present the values of AE, AB, and %ΔAE𝑖 for the cases we analyzed per strategy, respectively. AB can be positive (for overestimation) or negative (for underestimation), while AE is always positive. We used * to show the models with p<0.05 and didn’t calculate %ΔAE𝑖 for those models (note: we used “NA” as the %ΔAE𝑖 value of the models with p<0.05). In each case, the model that has the smallest value of AE and p>0.05 was selected as having the best prediction capability and is highlighted in green. In addition, the model/models with %ΔAE𝑖 < 2 were also selected as the best prediction models, which, show similar prediction capability compared to the best model (the model/models with %ΔAE𝑖 = 0). Comparing the grouping strategies in terms of prediction capabilities, only for Mac (St. 1) and Mac_OS_X (St. 2) and Linux (St. 1) and Linux_Kernel (St. 2), was there one common best model. In other words, the Gamma-based VDM and the Power- law model were the best models for IOS, Linux in both strategies, respectively. However, for other OSs, the results associated with St. 1 and St. 2 differ. 80 Table 30: PREDICTION ACCURACY FOR WEB BROWSERS (ST.1) Windows Mac AE AB HH %ΔAEi AE AB HH %ΔAEi Gamma 0.13 0.12 0.167 7.59 0.15 -0.14 0.280 0.00 Weibull 0.13 0.12 0.169 7.72 0.25 -0.25 0.435 10.40 AML 0.05 -0.05 0.103 0.00 0.27 -0.27 0.459 12.57 Normal 0.05 -0.05 0.103 0.00 0.27 -0.27 0.459 12.57 Power-law 0.13 0.12 0.171 7.85 0.36 0.36 0.322 21.40 RE 0.45 0.45 0.531 39.98 0.87 0.87 0.737 72.64 RQ 0.07 0.05 0.094 2.07 0.20 0.20 0.181 5.29 YF 0.08 0.08 0.100 3.01 0.24 -0.24 0.420 9.64 IOS Linux AE AB HH %ΔAEi AE AB HH %ΔAEi Gamma 0.17 -0.17 0.228 4.13 0.31 -0.31 0.594 18.33 Weibull 0.17 -0.17 0.225 3.91 0.38 -0.38 0.733 25.17 AML 0.25 -0.25 0.390 12.69 0.41 -0.41 0.785 28.00 Normal 0.25 -0.25 0.390 12.69 0.41 -0.41 0.785 28.00 Power-law 0.16 -0.16 0.224 3.87 0.14 -0.08 0.245 1.77 RE 0.20* 0.20 0.226 NA 0.13 0.09 0.107 0.00 RQ 0.27* -0.27 0.388 NA 0.14* -0.04 0.206 NA YF 0.13 -0.12 0.179 0.00 0.40 -0.40 0.771 27.10 Table 31: PREDICTION ACCURACY FOR OSS (ST.2) Windows (Vista & 7) Mac_OS_X AE AB HH %ΔAEi AE AB HH %ΔAEi Gamma 0.14 -0.12 0.252 5.83 0.24 -0.24 0.443 0.40 Weibull 0.17 -0.16 0.313 8.86 0.31 -0.31 0.549 7.56 AML 0.24 -0.24 0.421 15.91 0.32 -0.32 0.559 8.54 Normal 0.24 -0.24 0.421 15.91 0.32 -0.32 0.559 8.54 Power-law 0.08 0.05 0.080 0.00 0.23 0.23 0.197 0.00 RE 0.15 0.15 0.131 6.50 0.55 0.55 0.440 31.42 RQ 0.11 0.10 0.097 2.50 0.23 0.23 0.198 0.04 YF 0.23 -0.23 0.404 14.37 0.31 -0.31 0.552 7.82 IOS (11.x & 12.x) Linux_Kernel AE AB HH %ΔAEi AE AB HH %ΔAEi Gamma 0.08 0.04 0.099 0.41 0.31 -0.31 0.552 26.96 Weibull 0.08 0.04 0.099 0.45 0.38 -0.38 0.705 33.67 AML 0.07 -0.07 0.084 0.00 0.40 -0.40 0.730 35.28 Normal 0.07 -0.07 0.084 0.00 0.40 -0.40 0.730 35.28 Power-law 0.08 0.04 0.100 0.46 0.04 0.02 0.053 0.00 RE 0.29* 0.29 0.359 NA 0.33 0.33 0.369 28.72 RQ 0.10 0.07 0.129 2.38 0.06* -0.06 0.118 NA YF 0.13 0.12 0.172 5.91 0.37 -0.37 0.685 32.90 6.6.2 Web browsers Tables 32-33 present the values of AE, AB, and %ΔAE𝑖 for the four web browsers (St. 1 and St. 2). 81 Table 32: PREDICTION ACCURACY FOR WEB BROWSERS (ST.1) IE Safari AE AB HH %ΔAEi AE AB HH %ΔAEi Gamma 0.20 -0.16 0.364 7.90 0.16 0.16 0.140 5.42 Weibull 0.22 -0.19 0.423 10.17 0.10 0.01 0.131 0.00 AML 0.33 -0.33 0.673 21.31 0.15 -0.14 0.271 4.94 Normal 0.33 -0.33 0.673 21.31 0.15 -0.14 0.271 4.94 Power-law 0.17 -0.09 0.266 4.92 0.43 0.43 0.388 32.61 RE 0.12 0.06 0.104 0.00 1.02 1.02 0.870 91.95 RQ 0.15 -0.05 0.221 3.78 0.29 0.29 0.266 18.74 YF 0.29 -0.29 0.610 17.77 0.12 -0.05 0.187 1.34 Firefox Chrome AE AB HH %ΔAEi AE AB HH %ΔAEi Gamma 0.05 -0.05 0.049 2.21 0.21 -0.21 0.360 0.00 Weibull 0.05 -0.04 0.049 2.17 0.31 -0.31 0.508 10.13 AML 0.18* -0.18 0.236 NA 0.27 -0.27 0.446 6.22 Normal 0.18* -0.18 0.236 NA 0.27 -0.27 0.446 6.22 Power-law 0.05 -0.04 0.048 2.13 1.00 1.00 0.833 79.27 RE 0.06 0.06 0.086 3.94 2.14* 2.14 1.537 NA RQ 0.02 0.00 0.031 0.00 0.40 0.40 0.374 19.41 YF 0.07 -0.07 0.079 4.13 0.26 -0.26 0.435 5.12 Table 33: PREDICTION ACCURACY FOR WEB BROWSERS (ST.2) IE (5.x &6.x) IE (9.x &10.x) AE AB HH %ΔAEi AE AB HH %ΔAEi Gamma 0.06 0.06 0.065 2.07 0.23* 0.23 0.268 NA Weibull 0.04 0.03 0.043 0.00 0.17* 0.17 0.192 NA AML 0.10* -0.10 0.138 NA 0.04 -0.04 0.054 1.80 Normal 0.10* -0.10 0.138 NA 0.04 -0.04 0.054 1.80 Power-law 0.17* 0.17 0.169 NA 0.33* 0.33 0.365 NA RE 0.27* 0.27 0.260 NA 0.68* 0.68 0.680 NA RQ 0.23* 0.23 0.218 NA 0.33* 0.33 0.363 NA YF 0.07* -0.07 0.110 NA 0.03 0.03 0.029 0.00 IE (10.x &11.x) Firefox (0.x , 1.x, 2.x) AE AB HH %ΔAEi AE AB HH %ΔAEi Gamma 0.06 0.06 0.067 0.85 0.14* -0.14 0.157 NA Weibull 0.05 -0.05 0.066 0.00 0.14* -0.14 0.157 NA AML 0.12 -0.12 0.149 6.88 0.24* -0.24 0.280 NA Normal 0.12 -0.12 0.149 6.88 0.24* -0.24 0.280 NA Power-law 0.43* 0.43 0.442 NA 0.14* -0.13 0.148 NA RE 0.86* 0.86 0.811 NA 0.11* -0.10 0.119 NA RQ 0.41* 0.41 0.424 NA 0.14 -0.14 0.160 2.21 YF 0.09 -0.09 0.114 3.79 0.12 -0.11 0.134 0.00 Chrome (0.x , 1.x, 2.x) Chrome (3.x , 4.x) AE AB HH %ΔAEi AE AB HH %ΔAEi Gamma 0.12* 0.12 0.175 NA 0.25 -0.25 0.305 0.00 Weibull 0.04 0.04 0.063 2.10 0.27 -0.27 0.329 2.51 AML 0.02 -0.02 0.026 0.17 0.27* -0.27 0.334 NA Normal 0.02 -0.02 0.026 0.17 0.27* -0.27 0.334 NA Power-law 0.52* 0.52 0.779 NA 0.04* -0.02 0.052 NA RE 1.19* 1.19 1.778 NA 0.04* 0.04 0.074 NA RQ 0.55* 0.55 0.816 NA 0.12* -0.12 0.135 NA YF 0.02 0.01 0.028 0.00 0.27* -0.27 0.338 NA 82 6.6.3 Summary of prediction results Overall, in terms of prediction, considering the OSs (St. 1), all the models except the Weibull-based VDM were selected as the best model in one case out of four. Considering the OSs (St. 2), the Power-law model was the best one in all four cases. The Gamma-based VDM was the most accurate in two cases; the other models except the RE model each were selected as the best model in one case. Considering the web browsers (St. 1), the Gamma and Weibull-based VDMs and the RE, YF, and RQ models were each better than other models in one case out of four. However, considering the web browsers (St. 2), the YF model was most accurate in three cases out of six. Gamma, Weibull, AML, and Normal-based VDMs are the best models in two cases. Please remember that all the conclusions we draw from this research and their validity are limited by our database uncertainties. 6.7 Discussion In this section, we take a closer look at our findings and then offer some guidelines to model vulnerability discovery data with respect to the shape of the discovery intensity functions. Tables 34-35 show the summary of the selected models per dataset for curve- fitting and prediction, respectively. For curve-fitting, the models p-value<0.05 are highlighted in red. Based upon the results, in terms of curve fitting, out of eight datasets grouped by Strategy #1 (St. 1), the Gamma-based, Normal-based, Weibull-based, Power-law, AML and RQ VDMs provided the best (or equally best) fit in four, three, 83 two, two, one, one, cases, respectively. The other models were most accurate in neither of the cases. Table 34: SUMMARY OF SELECTED MODELS PER DATASET (CURVE-FITTING) OS Web Browsers Window Safar Mac IOS Linux IE Firefox Chrome s i Gamm                   a Weibu                   ll AML                   Norma                   l Power                   -law RE                   RQ                   YF                   Table 35: SUMMARY OF SELECTED MODELS PER DATAS ET (PREDICTION) OS Web Browsers Window Safar Mac IOS Linux IE Firefox Chrome s i Gamm                   a Weibu                   ll AML                   Norma                   l Power                   -law RE                   RQ                   YF                   In terms of curve fitting, out of ten datasets in St.2, the YF, Gamma-based, and RE VDMs were most accurate in four, three, and three cases, respectively. The AML, 84 3&4 3&4 0&1&2 0&1&2 St.1 St.1 0&1&2 0&1&2 St.1 St.1 St.1 St.1 10&11 10&11 9&10 9&10 5&6 5&6 St.1 St.1 St.2 St.2 St.1 St.1 St.2 St.2 St.1 St.1 St.2 St.2 St.1 St.1 St.2 St.2 St.1 St.1 Curve- Fitting Prediction Normal-based, and RQ VDMs each were most accurate in one case. The other models were most accurate in neither of the cases. In terms of prediction accuracy, out of eight datasets in St.1, the Gamma-based, RE, and YF VDMs, each provided the best (or equally best) fit in two cases. The other models each were most accurate in one case. In terms of prediction accuracy, out of ten datasets in St.2, the Gamma-based, Power-law, Weibull-based, AML, Normal-based, and YF VDMs were most accurate in four, four, three, three, and three cases, respectively. The RQ model was most accurate in one case. Based upon our findings we found similar guideline per grouping strategy, in terms of cure-fitting (the Gamma-based VDM). However, we found different guidelines, in terms of prediction accuracy. In some cases, we found that the selected model/models for a software from St. 1 is/are similar to those selected for its subversions from St. 2. However, it just occurred in few cases. For IOS (curve-fitting), AML and Normal-based VDMs were the best models in both strategies. For Chrome (curve-fitting) and one of its subversions (Chrome 3.x & 4.x), the Gamma-based VDM was the best common model considering both strategies. For Mac (prediction), the Gamma-based VDM was the most accurate model in both strategies. For Linux (prediction), the Power-law model was the best model in both strategies. 6.8 Limitations There are several limitations to our work that prevent us from making more general conclusions. Part of the limitations we are facing with are common with previous chapters like the uncertainties associated with vulnerability databases and the 85 way we collect the vulnerability data reported for any version of a software from different sources and filtering them based upon the earliest date that a given vulnerability was known in any of these sources (cf. Section 4.7). In addition, we have only applied the approach to 8 software and their subsets. We don’t know yet how well this works for other software and vulnerability datasets associated with them (cf. Section 5.9). The next limitation is that we assumed the percentage of common vulnerabilities for two or more consecutive versions of a software would be a good measure regarding the similarity of their source code (i.e., we grouped the consecutive versions with more than 70% common reported vulnerabilities). There other factors which could be considered together with our assumption to improve its validity. For example, looking at the source code of those versions and finding the number of lines with similar code. However, collecting such features is only feasible for open-source software and can’t be applied on private software. 6.9 Summary In this chapter, we developed some guidelines for analyzing vulnerabilities (i.e., those datasets where more vulnerabilities are reported earlier in the product lifecycle) using two vulnerability grouping strategies. We compared the curve fitting and prediction capabilities of eight different models: one NHPP Power-law model, two right-skewed distribution models, one flexible-skewed distribution model, three symmetric distribution models, and two non S-shaped VDMs. These models were applied on eighteen datasets that originated from four OSs and four web browsers. The datasets were built using two strategies for grouping vulnerabilities (vulnerabilities 86 merged for all versions and groups built based on a number of common vulnerabilities across versions). We found that a model’s ability to provide a good fit does not necessarily guarantee superior prediction capabilities from the same model. 87 Chapter 7: Vulnerability Prediction Capability: A Comparison between Vulnerability Discovery Models and Neural Network Models 7.1 Introduction In this chapter, we introduce an approach for predicting the total number of software vulnerabilities and compare the results with those found from aforementioned VDMs. Our approach uses a neural network model (NNM) to model the nonlinearities associated with vulnerability disclosure. Eight common VDMs were used to compare their prediction capability with NNM. 7.2 Motivation Vulnerability discovery models (VDMs) were developed to predict future software vulnerabilities based on their historical behavior. Although VDMs are often accurate in terms of curve fitting, they might not perform well in prediction phase [25]. Indeed, VDMs are often not powerful enough to take the nonlinear nature of vulnerability disclosure times into consideration. In recent years, some software vulnerability disclosure process models were developed using traditional time series models like Auto Regressive Moving Average (ARIMA) [67]. However, vulnerability disclosure data contain a lot of nonlinearity and thus traditional time series models might not be appropriate [68]. Pokhrel et al. [69] compared the modeling capability of linear and nonlinear time series for three OSs (i.e. Windows 7, Mac OS X, and Linux 88 Kernel). They developed models based on ARIMA, Artificial Neural Network (ANN), and Support Vector Machine (SVM) settings. In this chapter, we make the following contributions: - We introduce a nonlinear modeling approach based on neural networks to model the nonlinearities associated with vulnerability disclosure times and predict the total number of software vulnerabilities in 30-day time intervals. - We compare the prediction capability of the neural network model (NNM) with eight commonly used VDMs. We applied the models to vulnerability data associated with our aforementioned database consist of four well known operating systems (OSs) (Windows, Mac, IOS (the OS associated with Cisco), and Linux), as well as four well-known web browsers (Internet Explorer, Safari, Firefox, and Chrome). - We show that the NNM outperforms the VDMs in all the cases in terms of prediction accuracy, and provides smaller values of absolute average bias in seven cases. 7.3 Data Processing We will analyze the reported vulnerabilities associated with four well-known OSs: Windows (1995-2017), Mac (1997-2017), IOS (the OS associated with Cisco) (1992-2017), and Linux (1994-2017), as well as four well-known web browsers: Internet Explorer (1997-2017), Safari (2003-2017), Firefox (2003-2017), and Chrome (2008-2017). These software have been selected because they are the most widely used and have the most vulnerabilities in the database. Figures 8 and 9 show the detection frequency of all vulnerabilities associated with each software over time intervals of 30 89 days for the studied OSs and web browsers, respectively. We also plotted the 180-days moving average (MOVAVG) for each software to gain a better understanding of vulnerability detection trend. As is shown, the maximum value of MOVAVG for all cases occurred after 2015. As mentioned in Chapter 3, for each software, we analyze all vulnerabilities reported for any of its versions. Thus, for each software, all the vulnerabilities reported for any of its versions were included. In addition, regarding our analysis, we divided the vulnerability dataset associated with each software into two groups: training and testing. The training data set consists of all the vulnerabilities reported before 2015. The testing data set consists of vulnerabilities reported in years 2015, 2016, and 2017. Table 36 represents the total number of vulnerabilities per software, as well as the number of vulnerabilities in the training and testing phases. Table 36: NUMBER OF VULNERABILITIES PER SOFTWARE OS Windows Mac IOS Linux Total 3100 2705 650 4745 Train 2237 1605 438 2609 Test 863 1100 212 2136 Web Browser IE Safari Firefox Chrome Total 1775 943 1477 1837 Train 1059 701 1150 1229 Test 716 242 327 608 90 Histogram of the number of detected vulnerabilities per 30 days together with its 180-days moving average for the studied OSs. The X-axis represents time (Year). The Y-axis shows the frequency of discovered vulnerabilities over 30 days time intervals. The blue and red colors show data associated with the training and test datasets. Histogram of the number of detected vulnerabilities per 30 days together with its 180-days moving average for the studied Web browsers. The X-axis represents time (Year). The Y-axis shows the frequency of discovered vulnerabilities over 30 days time intervals. The blue and red colors show data associated with the training and test datasets. 91 7.4 Neural Network Model (NNM) Neural network models (NNMs) consist of a set of algorithms for modeling and recognizing patterns. NNMs have been widely used for predicting data with sequential time series data such as monthly electricity demand of a city or stock price [68], [70], [71]. Unlike VDMs, NNMs are capable of integrating the nonlinearity that exist in noisy time series data. In addition, NNMs are not built upon assumptions regarding the form of the basic model since they are completely data driven models. In other words, NNMs are flexible nonlinear data driven models with powerful prediction power. Data driven models are useful for the cases where there is not a theoretical guidance to explain the data generation process. It has been empirically shown that NNMs are capable of predicting both linear and nonlinear time series of different forms [72]. To predict the number of discovered vulnerabilities getting over time for a given software, we use a feedforward NNM, which is the most widely used neural network [68]. Feedforward NNMs accept a fixed number of inputs at a time and generate one output. We assume that the number of future vulnerabilities depend on the number of vulnerabilities disclosed over the past periods (lags). We use a single hidden-layer NNM for one step-ahead prediction. According to [73], a single hidden layer NNM is capable of approximating any non-linear function with arbitrary precision. Figure 10 shows the structure of the NNM used in our study. Our feedforward NNM consists of three layers called input, hidden, and output. Each layer is a collection of neurons (nodes) where the connections are governed by the corresponding weights. Data have been fed through the input layer, and then they pass 92 through the one or more hidden layers, and the final outcome is provided by the output layer. The NNM Architecture Used for Our Study To predict the present value, several past observations are used. In other words, the inputs are a p-element subset of the set {𝑦𝑡−𝑝, . . . , 𝑦𝑡−2, 𝑦𝑡−1}; and 𝑦𝑡 is the output or the total number of vulnerabilities reported in period t. Equations 25 and 26 show the formulas associated with the input and output values of the hidden layer, respectively. For the output layer, the input and output values are represented by equations 27 and 28, respectively. 𝑡−1 𝐼𝑗 = ∑ 𝑤𝑗𝑖 × 𝑦𝑖 + 𝛽𝑗 (𝑗 = 1, … , ℎ), (25) 𝑖=𝑡−𝑝 𝑦𝑗 = 𝑓ℎ(𝐼𝑗) (𝑗 = 1, … , ℎ), (26) 93 ℎ 𝐼𝑜 = ∑ 𝑤𝑜𝑗 × 𝑦𝑗 + 𝛼𝑜 (𝑜 = 1), (27) 𝑗=1 𝑦𝑡 = 𝑓𝑜(𝐼𝑜) (𝑜 = 1), (28) where I denotes the input; y denotes the output; p and h are the number of input and hidden layer nodes, respectively; 𝑤𝑗𝑖 represents the connection weights of the input and hidden layers; and 𝑤𝑜𝑗 denotes the connection weights of the hidden and output layers. The bias values of the hidden and output layers are respectively shown by 𝛽𝑗 and 𝛼𝑜, and are always between -1 and 1. 𝑓ℎ and 𝑓𝑜 are the non-linear activation functions associated with the hidden and the output layers, respectively. As the hidden layer activation function, we used a hyperbolic tangent function since it is the function that most widely used [68]. The initial step in designing a NNM is to determine the optimal number of input nodes (lags) and hidden layer nodes. Based on the literature, there is no systematic approach [68]; the most common way of identifying the appropriate number of the nodes (input and hidden) is via trial and error based upon finding the minimum mean square error (MSE) of the test data [74]. We followed this approach and identified the number of hidden nodes experimentally for the time series associated with each software. We evaluated up to 50 hidden nodes for each time series and chose the number of hidden nodes that minimized the MSE. Regarding the optimal number of inputs (lags), we used an optimization algorithm and chose the best combination of lags that led to the lowest MSE. We started with statistically significant lags derived from the process of evaluating the partial autocorrelation function (PACF) associated with each time series. 94 In time series analysis, PACF gives the linear partial correlation of a time series with its own lagged values and evaluated [75]. However, we cannot only rely on the lags we found from the PACF since, in such case, the selection of inputs would be merely based on the identification of a linear model, while the goal for using NNM is to capture non- linear correlations, as well. A very good review of existing input selection methods for NNMs is provided in [76]. The NNM developed was programmed using Matlab R2018a. For each software, we began our analysis by dividing the vulnerability dataset into two groups; training and testing. The training dataset consists of all the vulnerabilities reported before 2015. The testing data set consists of vulnerabilities reported in years 2015, 2016, and 2017. NNM training is a complex nonlinear optimization problem. Thus, there is the possibility to get trapped in local minima of the error surface. To avoid getting poor results, the training process should be repeated several times with different random starting weights and biases [72]. We set the maximum training number equal to 500 epochs. Epoch stands for the total number of times a given dataset is utilized for training and shows the number of times the weights in a network were updated [77]. Since model optimization in deep learning algorithms is done using the gradient decent method [78], it makes sense to pass the learning dataset through the network multiple times accordingly to update the weights and achieve a more accurate model, in terms of prediction [77]. We used the Levenberg-Marquardt (LM) method as our learning function. The activation function of the hidden and output layers are the tansig and purelin functions, respectively. To avoid overfitting/over training, for each software, we employed a cross validation 95 method by dividing our training dataset into three subgroups of training data (70%), validation data (15%), and test data (15%); and checked the validation performance of the trained network via metrics provided by Matlab Neural Network toolbox such as gradient decent (gradient threshold=1.00e-4) and maximum number of validation checks (max_fail=100). These metrics served as stop conditions of the training phase. Whenever the parameters of the network under training met any of these thresholds, the training process was stopped. This process avoids the algorithm to be over-trained and produce overfitted results. 7.5 Analysis We used the eight VDMs introduced in Chapter 3 and one NNM for the discovery process of vulnerabilities in four OSs and four web browsers. The VDMs were fitted to the datasets using a non-linear regression method described in [24]. The analysis of the prediction capability started by dividing the data into two groups of training and testing data. Both the VDMs and the NNM use a dataset that includes all vulnerabilities reported for all versions of a given software. The training period starts from the time when the first vulnerability associated with a given software was discovered and continues until 12/31/2014. We calculated the predictions for the years 2015, 2016, and 2017. As it is shown in Figure 8 and Figure 9, the blue and the red show the data associated with training and testing datasets, respectively. We split the vulnerability data into intervals of 30 days as is common in the vulnerability analysis literature [18], [24], [25]. For the VDMs, during the training period, the training data was used to estimate model parameters. The estimated final values for each 30-day interval produced by the 96 eight models were compared with the actual number of vulnerabilities to calculate the prediction accuracy. For the NNM, for each software, we used the training data to train the NNM. Using the trained NNM, we predicted the expected values for the next steps. The prediction accuracy is based on the comparison between the obtained estimation and the actual number of vulnerabilities. For the training part, for VDMs, we applied the Chi-square (χ2) goodness of fit test to assess how well each model fits the training datasets. For the training part, for the NNM, out of the models trained with different number of lags, the optimal analytical model was selected based on the MSE value. Finally, for each software, the best selected analytical model was used to make the prediction for the testing data set (the vulnerabilities reported in 2015, 2016, and 2017). In this chapter, regarding the NNMs, we just reported the results associated with the best NNM. For the prediction part, we calculated the two normalized predictability measures, AE, AB. These indicator and their associated equations were introduced in Chapters 4 and 6. In addition, for the VDMs, we report ΔAE𝑖, which represents the percentage of difference between the AE of the i-the model and the model with minimum AE. 7.6 Results Tables 37- 38 present the values of AE, AB, ΔAE𝑖, and p-value (we used * to show the models with p<0.05) for the cases we analyzed per model (VDMs and NNM), respectively. AB can be positive (for overestimation) or negative (for underestimation), while AE is always positive. In each case, we first found the best VDMs by comparing their prediction accuracy and then compared the accuracy of those models with the 97 NNM. In other words, for the VDMs, the model that has the smallest value of AE was selected as having the best prediction capability and is highlighted in yellow. In addition, the VDMs with ΔAE𝑖 < 2 were also selected as the best VDMs, which, show similar prediction capability compared to the best model (the model/models with ΔAE𝑖 = 0). In addition, the normalized error values ((Ω𝑡 − Ω)/Ω) associated with the OSs and web browsers are plotted in Figure 11 and Figure 12, respectively. As it is shown, the models with less fluctuations lead to higher accuracy. Table 37: PREDICTION ACCURACY FOR OSS (VDMS & NNM) Windows Mac AE AB %ΔAEi p-value AE AB %ΔAEi p-value Gamma 0.028 0.002 0.000 0.805 0.247 -0.247 20.537 0.817 Weibull 0.036 -0.031 0.776 0.870 0.281 -0.281 24.006 0.437 AML 0.087 -0.087 5.844 0.016* 0.287 -0.287 24.601 0.809 Normal 0.087 -0.087 5.844 0.016* 0.287 -0.287 24.601 0.809 Power-law 0.078 0.078 4.982 0.435 0.062 -0.045 2.066 0.030 RE 0.172 0.172 14.365 0.027* 0.041 0.041 0.000 0.402 RQ 0.075 0.075 4.718 0.579 0.066 -0.049 2.452 0.001* YF 0.059 -0.057 3.067 0.178 0.279 -0.279 23.751 0.751 NNM 0.015 -0.011 NA NA 0.018 0.010 NA NA IOS Linux AE AB %ΔAEi p-value AE AB %ΔAEi p-value Gamma 0.076 -0.076 4.778 0.901 0.277 -0.277 9.302 0.828 Weibull 0.071 -0.071 4.195 0.902 0.272 -0.272 8.791 0.611 AML 0.045 -0.045 1.646 0.890 0.336 -0.336 15.146 0.000* Normal 0.045 -0.045 1.646 0.890 0.336 -0.336 15.146 0.000* Power-law 0.070 -0.070 4.173 0.002* 0.246 -0.246 6.204 0.342 RE 0.059 0.045 3.027 0.001* 0.184 -0.184 0.000 0.705 RQ 0.199 -0.199 16.988 0.000* 0.238 -0.238 5.341 0.342 YF 0.029 -0.018 0.000 0.930 0.307 -0.307 12.229 0.569 NNM 0.021 -0.014 NA NA 0.040 -0.034 NA NA 98 Prediction errors for OSs. The X-axis indicates time (Year). The Y-axis represents normalized prediction error values ((Ω − Ω)/Ω). 𝑡 Table 38: PREDICTION ACCURACY FOR WEB BROWSERS (VDMS & NNM) IE Safari AE AB %ΔAEi p-value AE AB %ΔAEi p-value Gamma 0.264 -0.264 5.887 0.891 0.165 -0.165 9.391 0.049 Weibull 0.264 -0.264 5.841 0.891 0.216 -0.216 14.474 0.330 AML 0.319 -0.319 11.357 0.009* 0.225 -0.225 15.371 0.330 Normal 0.319 -0.319 11.357 0.009* 0.225 -0.225 15.371 0.330 Power-law 0.264 -0.264 5.820 0.891 0.072 0.072 0.000 0.409 RE 0.206 -0.206 0.000 0.309 0.190 0.190 11.801 0.017* RQ 0.249 -0.249 4.324 0.590 0.081 0.081 0.976 0.693 YF 0.268 -0.268 6.287 0.309 0.215 -0.215 14.336 0.978 NNM 0.028 0.006 NA NA 0.042 -0.037 NA NA Firefox Chrome AE AB %ΔAEi p-value AE AB %ΔAEi p-value Gamma 0.036 0.032 0.123 0.263 0.248 -0.248 0.000 0.890 Weibull 0.036 0.032 0.099 0.263 0.292 -0.292 4.379 0.897 AML 0.088 -0.088 5.304 0.008* 0.270 -0.270 2.174 0.805 Normal 0.088 -0.088 5.304 0.008* 0.270 -0.270 2.174 0.805 Power-law 0.056 0.056 2.089 0.263 0.336 0.336 8.802 0.000* RE 0.160 0.160 12.452 0.091 0.623 0.623 37.481 0.000* RQ 0.084 0.084 4.927 0.159 0.120 0.120 NA 0.000* YF 0.035 -0.034 0.000 0.121 0.269 -0.269 2.146 0.950 NNM 0.030 -0.016 NA NA 0.043 -0.009 NA NA 99 Prediction errors for web browsers. The X-axis indicates time (Year). The Y-axis represents normalized prediction error values ((Ω − Ω)/Ω). 𝑡 Based on the results provided by Tables 37-38, in terms of prediction accuracy (AE), the NNM led to most accurate results in all of the eight software we analyzed. To be more precise, for Windows, the NNM’s average error (AE) is 1.3%, and 2.1% smaller than the AEs associated with the best VDMs, which were Gamma and Weibull- based VDMs. For Mac, this difference is 2.3%. For IOS, the NNM outperforms the best VDMs (YF, AML and Normal) by having 0.8%, 2.4%, and 2.4% smaller average errors, respectively. Linux and IE are two of the cases where the NNM provides far better predictions than those from VDMs by being 14.4%, and 17.8% more accurate. The average error of the NNM for Safari is 3% and 3.9% smaller than those from the best VDMs (Power-law and RQ). 100 For Firefox, the NNM improved the predictions by 0.5%, 0.6%, and 0.6% compared to YF, Gamma, and Weibull-based VDMs, respectively. For Chrome, the VDM with smallest is not statistically sound from the training part. So, we opt for the next VDM with p-value>0.05 and smallest AE, which is Gamma. In this case, the NNM accuracy improvement is 20.5%. Overall, the highest differences in prediction accuracy between the NNM and the VDMs were found in Chrome (20.5%), IE (17.8%), Linux (14.4%), and Safari (3.9%), respectively. In terms of magnitude of error, out of eight software we analyzed, the NNM outperformed the VDMs in seven cases by having smaller |AB| values. Only for Windows, the absolute value of bias provided by one of the selected VDMs was 0.9% smaller than the one resulted from the NNM. For Mac, IOS, Linux, IE, Safari, Firefox, and Chrome, the bias magnitudes provided by the NNM were smaller than those from the best VDMs (in each case, we considered the best VDM, which had smallest |AB|) by 3.1%, 0.4%, 15%, 20%, 3.5%, 1.6%, and 23.9%, respectively. Overall, in terms of accuracy, out of the eight cases we analyzed, the NNM outperformed VDMs in all the cases. Besides, in terms of magnitude of bias, the NNM led to smallest bias values in seven cases. Please remember that all the conclusions we draw from this research and their validity are limited by our database uncertainties. 7.7 Discussion In terms of prediction accuracy (AE), considering the OSs and web browsers, the NNM led to more accurate results than the best selected VDMs in all the cases. The Gamma-based VDM was selected as the best model in three cases out of four. The 101 Weibull and YF VDMs were each best compared with other models in one case out of four. In terms of overall magnitude of bias (i.e., absolute value of AB), out of the eight cases we analyzed, the NNM provided smaller absolute values of bias in seven cases compared to the best VDMs. Only for Windows, the absolute value of bias provided by the Gamma-based VDM (0.002) was smaller than the one resulted from NN (-0.011). We believe that the final decision, in equal accuracy conditions, in terms of bias, is up to the researcher to choose the best model based upon his/her priorities. However, from a security point of view, it is better to choose a model, which provides more conservative prediction results. In the current study, among the models that were selected as the best predictors, only two NNMs (Mac and IE) provided overestimated results. Other selected NNMs underestimated the number of vulnerabilities. It can also be easily inferred from Figure 11 and Figure 12, where for Mac and IE most of the prediction points associated with the NNMs are located over the X=0 axis. We believe that the NNM’s better performance compared to VDMs comes from the capability of the NNM in predicting the nonlinearity nature of the vulnerability disclosure time series. In addition, most VDMs consider the vulnerability discovery process as a pure S-shaped curve or a function with a monotonic intensity function with constant total number of vulnerabilities. However, the number of vulnerabilities associated with a given software may change as newer versions are released. Additionally, VDMs and traditional time-series functions only use one set of parameters for estimation. On the other hand, NNMs due to having multilayer 102 perceptron structure, having multiple neurons per layer, and using different set of parameters per neuron provide a more complicated structure for prediction. Of course, the specific validation method we used to avoid being trapped by overfitting in the learning phase is another advantage of using NNMs. 7.8 Limitations There are a few limitations to our work that prevents us from expanding our conclusions in a more generalized manner. Some of them are common with previous chapters of this dissertation such as uncertainties associated with vulnerability data public repositories (cf. Section 4.7). Another common limitation is with respect to utilizing announced published date of vulnerabilities as their discovery date. Vulnerabilities normally get identified before by pernicious users than the time they are officially reported. To ensure that this gauge is as close as conceivable to the real date the vulnerability is publicly known to the world, we searched for various vulnerability repositories and selected the earliest date announced for a vulnerability (cf. Section 4.7). Another limitation is as to the manner in which we combined all vulnerabilities announced for all versions of a given software to have sufficient data for training the models (cf. Section 5.9). While number of studies utilize vulnerability data associated with separate version of software (e.g. Windows 7) on which to apply VDMs, there are papers that consider all versions of a software together [25], [62]. The first group expects that each version of a given software is an independent and all around characterized item, yet distinguishing the sources of reliance in vulnerability data is not a simple task. 103 The next limitation is that we only used one machine learning algorithm (Neural Network) to show the better performance of such algorithms over the commonly used VDMs. There are several other algorithms like SVMs, which could also be used. However, since our research question was not associated with comparing the prediction performance of other machine learning algorithms, we only chose one of them. 7.9 Summary In this chapter, we compared the capabilities of eight common vulnerability discovery models (VDMs) with a nonlinear neural network model (NNM) in terms of predicting the total number of future vulnerabilities over a prediction period of three years. We applied the mentioned models to vulnerability data associated with four well known OSs and four well-known web browsers. The models were assessed in terms of prediction accuracy and prediction bias. The results showed that the NNM outperformed the VDMs in all the cases in terms of prediction accuracy. In terms of overall magnitude of bias, out of the eight cases we analyzed, the NNM provided the smallest absolute values of bias in seven cases compared to the best VDMs. This chapter shows that neural networks are promising for accurate predictions of the total number of software vulnerabilities. 104 Chapter 8: Predicting Exploited Vulnerabilities 8.1 Introduction Exploited vulnerabilities typically form 2-7% of all the vulnerabilities reported for a given software. With a smaller number of known exploited vulnerabilities compared with the total number of vulnerabilities, it is more difficult to mathematically model and predict when a vulnerability with a known exploit will be reported. In this chapter, we introduce an approach for predicting the total number of publicly-known exploited vulnerabilities using all publicly-known vulnerabilities reported for a given software. Eight commonly used VDMs and one NNM were utilized to evaluate the prediction capability of our approach. We compared their predictions results with the scenario when only exploited vulnerabilities were used for prediction. 8.2 Motivation For some vulnerabilities, exploits are never published. This might be because the patches for these vulnerabilities are made available very quickly by the vendors, and hence it is not profitable for the crackers to develop exploits for them; the vulnerabilities have a lower criticality from the security viewpoint; or it might be that that exploits for these vulnerabilities are only known to the vendors, to security agencies or are exchanged in, for example, darkweb forums. Whatever the explanation may be, vulnerabilities with publicly-known exploits usually form only 2-7% of all the vulnerabilities reported for an specific version of given software [47], [79]. In addition, as opposed to vulnerability databases such as NVD, which are actively maintained, 105 security repositories reporting exploited vulnerabilities like Exploit Database, also known as “ExploitDB”, are less common. A comparison between NVD and ExploitDB finds that only 22% of NVD distinct vulnerabilities have exploits reported and listed in ExploitDB. On the other hand, vulnerabilities with known exploits are more dangerous to end users, even if patches may be available, since not all users regularly patch their systems. For this reason, it is important for both vendors and users to be able to predict the time to the next vulnerability with a known exploit. However, with a smaller number of known exploited vulnerabilities compared with the total number of vulnerabilities, it is difficult to model and predict when a vulnerability with a known exploit will be reported. Specifically, the data scarcity makes it difficult to use data driven models, which are helpful where there is no theoretical guidance to explain the data generation process, for such data [80]. Therefore, we postulate that it is a worthwhile research activity to explore whether there is a link between disclosure times of all vulnerabilities reported for a given software and discovery times of its exploited vulnerabilities. Finding such link would allow to use a larger dataset of all vulnerabilities for predicting the time when vulnerabilities with exploits will be reported. To the best of our knowledge, there is no research focusing on modeling exploited vulnerabilities. The only efforts in this area is probabilistic examination of intrusions by [52], [53]. Lack of data is a big barrier on the way to modeling exploited vulnerabilities using current VDMs or the machine learning techniques, which require considerable amount of data for satisfactory training. 106 In this chapter, we introduce an approach for predicting the total number of publicly-known exploited vulnerabilities using all vulnerabilities reported for a given software. Eight commonly used VDMs as well as one NNM were used to evaluate the prediction capability of our approach. We applied the models to vulnerability data associated with four well-known OSs (Windows, Mac, IOS (the OS associated with Cisco), and Linux), as well as four well-known web browsers (Internet Explorer, Safari, Firefox, and Chrome). Our work makes the following contributions: - We introduce an approach for predicting total number of publicly-known exploited vulnerabilities using all vulnerabilities reported for a given software; - We compare the prediction capability of two scenarios S1 and S2, S1 when all the vulnerabilities are considered, S2 when only exploited vulnerabilities are, utilizing eight VDMs and one NNM on eight well-known software; - We show that, out of eight software we analyzed, scenario S1 outperforms scenario S2 in seven cases in terms of prediction accuracy. Only in one case, the prediction of S1 was worse than S2 by 1.5%. In other words, for most of the cases analyzed, we show that using all the vulnerability data available for a system allows to better predict when vulnerabilities that will have publicly-known exploits for them will be reported 8.3 Data Processing We will analyze the reported vulnerabilities associated with four well-known OSs: Windows (1995-2017), Mac (1997-2017), IOS (the OS associated with Cisco) 107 (1992-2017), and Linux (1994-2017), as well as four well-known web browsers: Internet Explorer (1997-2017), Safari (2003-2017), Firefox (2003-2017), and Chrome (2008-2017). These software have been selected because they are the most widely used and have the most vulnerabilities in the database. For each software, all the vulnerabilities reported for any of its versions were included. For instance, all the vulnerabilities reported for mac_os, mac_os_server, mac_os_x, and mac_os_x_server were put together to create a vulnerability database for Mac. Two scenarios were considered. In the first scenario (S1), we analyze all vulnerabilities reported for a software for any of its versions. In the second scenario (S2), for each software, we only consider the exploited vulnerabilities. Table 39 presents the total number of vulnerabilities for each software (All vulnerabilities together (”S1”) and only exploited vulnerabilities (“S2”)). The percentages of exploited/unexploited vulnerabilities per software are presented in Figure 13. Windows and IE had the most percentages of exploited vulnerabilities with 24.13% and 22.65%, respectively. Table 39: NUMBER OF VULNERABILITIES PER SOFTWARE (ALL VS. EXPLOITED) OS Windows Mac IOS Linux # All 3100 2705 650 4745 Vulnerabilities # Exploited 748 282 27 481 Vulnerabilities Web Browser IE Safari Firefox Chrome # All 1775 943 1477 1837 Vulnerabilities # Exploited 402 108 100 78 Vulnerabilities 108 Percentage of exploited vulnerabilities per software 8.4 Analytical steps of scenario S1 8.4.1 For the VDMs In this section, we explain the approach we developed to predict the number of publicly-reported exploited vulnerabilities associated with a given software using all vulnerabilities reported for the software. Regarding VDMs, we need to find a relationship between the discovery pattern of all vulnerabilities (S1) and those vulnerabilities that were exploited (S2). We focused on the ratio of the time to next vulnerability (TTNV) for exploited vulnerabilities over the TTNV associated with all vulnerabilities. We use this ratio as a multiplier in the equations associated the VDMs in the training phase to approximate the VDMs’ equations for exploited vulnerabilities. We use a resampling method and a filtering method to take care of the noisy nature of vulnerability data [81], [82]. For each software, we resample/split the vulnerability data (all vulnerabilities & exploited vulnerabilities) into intervals of 120, 150, 180, 210, 240, 270, 300, 330, 360 days to remove the effect of daily fluctuations. For each interval (i- th interval), we calculate the mean TTNV of the observations at each time step (MTTNV) and calculate the ratio of MTTNVs, Ratio𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑖(t) = MTTNV𝐸𝑥𝑝𝑙𝑜𝑖𝑡𝑒𝑑(𝑡)/ 109 MTTNV𝐴𝑙𝑙(𝑡). Figures 14 and 15 show the box plot of ratios associated with each interval per software. The median of the ratios for each software, 𝑀𝑒𝑑𝑖𝑎𝑛 (Ratio𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑖(𝑡)) is almost constant over different intervals. The median values of the ratios per software are presented in Table 40. The VDM for exploited vulnerabilities is calculated as follows: Ω(𝑡)𝐸𝑥𝑝𝑙𝑜𝑖𝑡𝑒𝑑 = Ω(𝑡)𝐴𝑙𝑙/𝑀𝑒𝑑𝑖𝑎𝑛 (Ratio𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑖(𝑡)) (5) Box chart for TTNV coefficient ratio per OS (S2/S1) 110 Box chart for TTNV coefficient ratio per web browser (S2/S1) Table 40: TABLE OF TTVN MEAN RATIOS PER SOFTWARE OS Windows Mac IOS Linux TTNV Ratio 3.526 9.393 5.696 5.306 (Median) Web Browser IE Safari Firefox Chrome TTNV Ratio 3.360 5.792 10.917 50.127 (Median) 8.4.2 For the NNM Regarding the NNM, since we want to link two time series, we feed one time series (all vulnerabilities) into the NNM as input and select the output (𝑦𝑡) from the second time series (exploited vulnerabilities). In other words, the vector of inputs {𝑦𝑡−𝑝, . . . , 𝑦𝑡−2, 𝑦𝑡−1} belongs to S1 and the output is chosen from S2. Details on how the NNM was developed are in chapter 7. 111 8.5 Analysis For both scenarios (S1 and S2), we used the eight VDMs for the discovery process of vulnerabilities on eight well-known software (four OSs and four web browsers). The VDMs were fitted to the datasets using a non-linear regression method described in [24]. In addition, for the first scenario (S1) we also used one NNM, which is capable of modeling nonlinearities. Since the NNM is a data driven model, we could not use it for scenario S2 due to lack of exploited vulnerabilities. We started the analysis by splitting the data into two groups of training and testing data. For scenario S1, both the VDMs and the NNM use a dataset that includes all vulnerabilities reported for all versions of a given software. For scenario S2, the VDMs use the data associated with exploited vulnerabilities reported for those versions. The training period for both scenarios starts from the time when the first exploited vulnerability associated with a given software was reported and continues until 12/31/2014. We made the predictions for the years 2015, 2016, and 2017. We then split the vulnerability data into intervals of 30 days as is common in the vulnerability analysis literature [18], [24], [25]. For scenario S1, for the VDMs, during the training period, the training data was used to estimate model parameters. Using the estimated parameters and the TTNV ratios we found from Section 8.4.1, we estimated the number of exploited vulnerabilities. Then, the estimations for each 30-day interval produced by the eight models were compared with the actual number of exploited vulnerabilities to calculate the prediction accuracy. For the NNM, for each software, we used the training data to train the NNM. The process is like feeding the NNM by one time series and comparing 112 the outputs with values associated with another time series. Using the trained NNM, we predicted the number of exploited vulnerabilities for the next steps. We calculated the prediction accuracy by comparing the obtained estimation and the actual number of exploited vulnerabilities. For scenario S2, for the VDMs, during the training period, the training data was used to estimate model parameters. The estimated final values for each 30-day interval produced by the eight models were compared with the actual number of exploited vulnerabilities to calculate the prediction accuracy. The Chi-square (χ2) goodness of fit test [24] was utilized for evaluating the quality of fit of each model on training datasets. For the training part, for the NNM, we used the MSE value to select the optimal analytical model, out of the models trained with different combinations of lags. Then, for each software, the best model was selected to make the prediction for the testing data set (the vulnerabilities reported in 2015, 2016, and 2017). In this chapter, regarding the NNMs, we report the results associated with the best NNM. For the prediction part, we calculated the two normalized predictability measures, AE, AB. These indicators and their associated equations were introduced in Chapters 4 and 6. For the VDMs associated with each scenario, we also report ΔVAE𝑘𝑖 , which shows the difference between the AE of the i-the VDM and the VDM with minimum AE in the scenario to choose the best VDM(s) for each scenario. %ΔVAE𝑘 = (VAE𝑘𝑖 𝑖 − VAE 𝑘 𝑚𝑖𝑛) ∗ 100 (29) 113 where k is the k-th scenario, VAE𝑖 is the AE of the i-th VDM, and VAE𝑚𝑖𝑛 is the lowest AE found in the set of VDMs examined in the scenario (i.e., the best model). Thus, the ΔVAE𝑘𝑖 of the best VDM in a scenario is 0. To highlight the difference between the AE of the k-th model and the overall best model in both scenarios, we report ΔAE𝐺𝑖 , which is defined as follows: %ΔAE𝐺𝑗 = (AE𝑗 − AE 𝐺 𝑚𝑖𝑛) ∗ 100 (30) where AE𝑗 is the AE of the j-th model, and AE 𝐺 𝑚𝑖𝑛 is the lowest AE found in the set of models examined (i.e., the best model). Thus, ΔAE𝐺𝑗 of the best overall model is 0. In addition, if for a given model we have ΔAE𝐺𝑗 = 1.2 , it means than the model has 1.2% higher prediction error than the best overall model. 8.6 Results Tables 41-42 present the values of AE, AB, ΔVAE𝑘𝑖 , and ΔAE 𝐺 𝑗 for the cases we analyzed per scenario per model (VDMs and NNM), respectively. Regarding p-values, we used * to show the models with p<0.05. AB can be positive (for overestimation) or negative (for underestimation), while AE is always positive. In each case, we first found the best VDM(s) per scenario by comparing their prediction accuracy and then compared the accuracy of those models with the NNM results. In other words, for each software, for the VDMs associated with each scenario, the models that had the smallest values of AE were selected as the best VDMs in terms of prediction and were highlighted in yellow. In addition, the VDMs with ΔVAE𝑘𝑖 < 2 were also selected as 114 the best VDMs, which we assume, show similar prediction capability compared to the best VDM (the VDM(s) with ΔVAE𝑘𝑖 = 0). For each software, the best overall model in both scenarios is colored in green (the model with ΔAE𝐺𝑗 = 0). If a VDM is the best model of a scenario and simultaneously is the best overall model, is only colored in green. For each software, the normalized error values ((Ω𝑡 − Ω)/Ω) over prediction time are plotted in Figure 16 and Figure 17. The models with less fluctuations lead to higher accuracy. 115 Table 41: PREDICTION ACCURACY FOR OSS PER SCENARIO (VDMS & NNM) Windows S1 S2 AE AB %𝜟𝑨𝑬𝟏𝒊 %𝜟𝑨𝑬 𝑮 𝒋 AE AB %𝜟𝑨𝑬 𝟐 𝒊 %𝜟𝑨𝑬 𝑮 𝒋 Gamma 0.187 0.187 8.126 12.111 0.101 -0.097 1.703 3.539 Weibull 0.148 0.148 4.245 8.230 0.139 -0.139 5.432 7.268 AML 0.106 0.083 0.000 3.984 0.145 -0.145 6.021 7.856 Normal 0.106 0.083 0.000 3.984 0.145 -0.145 6.021 7.856 Power-law 0.277 0.277 17.076 21.061 0.084 0.084 0.000 1.836 RE 0.387 0.387 28.122 32.106 0.138* 0.138 NS NS RQ 0.274 0.274 16.766 20.750 0.113 0.113 2.892 4.727 YF 0.122 0.118 1.576 5.560 0.140 -0.140 5.579 7.415 NNM 0.066 -0.024 NA 0.000 NA NA NA NA Mac S1 S2 AE AB %𝜟𝑨𝑬𝟏𝒊 %𝜟𝑨𝑬 𝑮 𝒋 AE AB %𝜟𝑨𝑬 𝟏 𝑮 𝒊 %𝜟𝑨𝑬𝒋 Gamma 0.215 -0.215 14.203 14.985 0.261 -0.261 0.000 19.599 Weibull 0.251 -0.251 17.772 18.553 0.282 -0.282 2.038 21.637 AML 0.257 -0.257 18.399 19.180 0.277 -0.277 1.577 21.177 Normal 0.257 -0.257 18.399 19.180 0.277 -0.277 1.577 21.177 Power-law 0.073 -0.008 0.000 0.781 0.101* -0.008 NS NS RE 0.081* 0.081 NS NS 0.092* 0.005 NS NS RQ 0.077 -0.011 0.341 1.122 0.094* 0.017 NS NS YF 0.248 -0.248 17.513 18.295 0.280 -0.280 1.834 21.433 NNM 0.065 0.026 NA 0.000 NA NA NA NA IOS S1 S2 𝟏 𝑮 𝑮 AE AB %𝜟𝑨𝑬𝒊 %𝜟𝑨𝑬𝒋 AE AB %𝜟𝑨𝑬 𝟐 𝒊 %𝜟𝑨𝑬𝒋 Gamma 0.149 0.146 10.441 13.206 0.032 0.018 0.445 1.524 Weibull 0.156 0.154 11.126 13.891 0.029 0.001 0.182 1.260 AML 0.185 0.185 14.066 16.831 0.028 -0.014 0.000 1.079 Normal 0.185 0.185 14.066 16.831 0.028 -0.014 0.000 1.079 Power-law 0.156 0.154 11.153 13.918 0.240* 0.240 NS NS RE 0.301 0.301 25.647 28.412 0.279* 0.279 NS NS RQ 0.044 -0.007 0.000 2.765 0.153* 0.153 NS NS YF 0.220 0.220 17.555 20.321 0.028 -0.014 0.080 1.158 NNM 0.017 -0.002 NA 0.000 NA NA NA NA Linux S1 S2 AE AB %𝜟𝑨𝑬𝟏 𝑮𝒊 %𝜟𝑨𝑬𝒋 AE AB %𝜟𝑨𝑬 𝟏 𝑮 𝒊 %𝜟𝑨𝑬𝒋 Gamma 0.132 0.132 8.919 11.140 0.116 -0.116 7.620 9.542 Weibull 0.140 0.140 9.732 11.954 0.128 -0.128 8.874 10.796 AML 0.043 0.037 0.000 2.221 0.168 -0.168 12.811 14.733 Normal 0.043 0.037 0.000 2.221 0.168 -0.168 12.811 14.733 Power-law 0.182 0.182 13.914 16.135 0.040 0.040 0.000 1.922 RE 0.282 0.282 23.969 26.190 0.045 0.045 0.540 2.462 RQ 0.196 0.196 15.291 17.513 0.050 0.050 1.031 2.953 YF 0.084 0.084 4.135 6.356 0.158 -0.158 11.843 13.765 NNM 0.020 0.019 NA 0.000 NA NA NA NA 116 OS S1 S2 Prediction errors for Oss per scenario. The X-axis indicates time (Year). The Y-axis represents normalized prediction error values ((Ω − Ω)/Ω). 𝑡 117 Linux IOS Mac Windows Table 42: PREDICTION ACCURACY FOR WEB BROWSERS PER SCENARIO (VDMS & NNM) IE S1 S2 AE AB %𝜟𝑨𝑬𝟏 𝑮𝒊 %𝜟𝑨𝑬𝒋 AE AB %𝜟𝑨𝑬 𝟐 𝒊 %𝜟𝑨𝑬 𝑮 𝒋 Gamma 0.121 -0.121 7.189 9.018 0.175 -0.175 11.004 14.475 Weibull 0.120 -0.120 7.138 8.967 0.182 -0.182 11.700 15.170 AML 0.188 -0.188 13.922 15.752 0.256 -0.256 19.048 22.519 Normal 0.188 -0.188 13.922 15.752 0.256 -0.256 19.048 22.519 Power-law 0.120 -0.120 7.113 8.942 0.087 -0.087 2.172 5.643 RE 0.049 -0.049 0.000 1.829 0.065 -0.065 0.000 3.471 RQ 0.102 -0.102 5.292 7.121 0.069 -0.069 0.386 3.856 YF 0.126 -0.126 7.717 9.547 0.228 -0.228 16.253 19.724 NNM 0.030 0.010 NA 0.000 NA NA NA NA Safari S1 S1 AE AB %𝜟𝑨𝑬𝟏 %𝜟𝑨𝑬𝑮𝒋 AE AB 𝒊 %𝜟𝑨𝑬 𝟐 𝒊 %𝜟𝑨𝑬 𝑮 𝒋 Gamma 0.498 -0.498 20.921 41.101 0.140 -0.076 1.386 5.308 Weibull 0.528 -0.528 23.943 44.123 0.126 -0.106 0.000 3.923 AML 0.533 -0.533 24.492 44.672 0.131 -0.095 0.500 4.423 Normal 0.533 -0.533 24.492 44.672 0.131 -0.095 0.500 4.423 Power-law 0.357 -0.357 6.898 27.078 0.285 0.224 15.963 19.886 RE 0.288 -0.288 0.000 20.181 0.265 0.193 13.924 17.847 RQ 0.351 -0.351 6.312 26.493 0.270 0.202 14.462 18.384 YF 0.527 -0.527 23.863 44.043 0.127 -0.104 0.101 4.024 NNM 0.087 0.042 NA 0.000 NA NA NA NA Firefox S1 S1 𝑮 AE AB %𝜟𝑨𝑬𝟏𝒊 %𝜟𝑨𝑬𝒋 AE AB %𝜟𝑨𝑬 𝟐 𝒊 %𝜟𝑨𝑬 𝑮 𝒋 Gamma 0.324 0.324 15.768 31.419 0.010 0.005 0.000 0.000 Weibull 0.324 0.324 15.743 31.394 0.029 -0.029 1.912 1.912 AML 0.167 0.167 0.000 15.651 0.064 -0.064 5.344 5.344 Normal 0.167 0.167 0.000 15.651 0.064 -0.064 5.344 5.344 Power-law 0.355 0.355 18.888 34.539 0.209* 0.209 NS NS RE 0.492 0.492 32.510 48.162 0.199* 0.199 NS NS RQ 0.393 0.393 22.610 38.261 0.170* 0.170 NS NS YF 0.238 0.238 7.097 22.748 0.064 -0.064 5.370 5.370 NNM 0.026 -0.024 NA 1.594 NA NA NA NA Chrome S1 S1 AE AB %𝜟𝑨𝑬𝟏 𝑮𝒊 %𝜟𝑨𝑬𝒋 AE AB %𝜟𝑨𝑬 𝟐 𝒊 %𝜟𝑨𝑬 𝑮 𝒋 Gamma 0.531 -0.531 0.000 35.283 0.363 -0.363 9.162 18.482 Weibull 0.557 -0.557 2.635 37.918 0.359 -0.359 8.724 18.045 AML 0.544 -0.544 1.324 36.606 0.409 -0.409 13.722 23.043 Normal 0.544 -0.544 1.324 36.606 0.409 -0.409 13.722 23.043 Power-law 0.210* -0.204 NS NS 0.285 -0.231 1.404 10.725 RE 0.108* -0.052 NS NS 0.271 -0.148 0.000 9.321 RQ 0.325* -0.325 NS NS 0.330 -0.328 5.831 15.152 YF 0.544 -0.544 1.275 36.557 0.473* -0.473 NS NS NNM 0.178 -0.157 NA 0.000 NA NA NA NA 118 OS S1 S2 Prediction errors for web browsers per scenario. The X-axis indicates time (Year). The Y-axis represents normalized prediction error values ((Ω − Ω)/Ω). 𝑡 119 Chrome Firefox Safari IE Based upon the results provided by Tables 41-42, in terms of prediction accuracy (AE and HH), out of eight software we analyzed, scenario S1 led to the most accurate results in seven cases. Only for Firefox, the best VDM from scenario S2 was more accurate than the best model of scenario S1, which is NNM. In addition, considering both scenarios, the NNM was selected as the best prediction model in seven cases. As mentioned before, the VDMs with the * superscript are the models that had a p-value less than 0.5 and will not be considered in our analysis. In Tables 41-42, we used the term “NS” for these models, which stands for Not Satisfactory. For Windows, the best model from scenario S1, which is NNM (ΔAE𝐺𝑗 = 0), is 1.8% more accurate than the one from scenario S2 (the model with smallest AE in scenario S2, ΔAE𝐺𝑗 = 1.836). For Mac, the best model is also NNM by having 19.59% smaller average prediction error (AE) than the best model from scenario S2. For IOS, Linux, IE, Safari, and Chrome the stories are like what happened for Windows and Mac by NNM (from S1) as being the best model, which comes up with 1.1%, 1.9%, 3.5%, 3.9%, and 9.3% smaller prediction errors than the best models from scenario S2. For Firefox, the model with smallest AE (ΔAE𝐺𝑗 = 0) belongs to scenario S2 by having 1.6% smaller AE than the best model from scenario S1, which is NNM (ΔAE𝐺𝑗 ≈ 1.6). Overall, scenario S1 provides more accurate results in seven cases (out of eight cases) for the number of future exploited vulnerabilities. In the only case that the best model from scenario S2 provided most accurate predictions, the performance of the best model from scenario S1 was only 1.6% worse. To evaluate the performance of our approach among VDMs, we also need to compare the results we found only from VDMs. Considering only VDMs, in terms of 120 prediction accuracy (AE and HH), out of eight software we analyzed, scenario S1 led to most accurate results in only two cases. In other words, for Mac, and IE, the best VDM from S1 had higher accuracy than the best VDM from scenario S2 by having 18.8%, and 1.6% smaller prediction errors, respectively. However, the VDMs from scenario S1 were less than 2.2% different in prediction error in three cases compared to the best VDM from scenario S2. The error differences for Windows, IOS, and Linux are 2.2%, 1.6%, and 0.3%, respectively. Only for Safari, Firefox, and Chrome this difference is high and the best VDM from scenario S2 outperformed the best VDM from scenario S1 by having 16.2%, 15.7%, and 26% smaller prediction error, respectively. Overall, comparing only VDMs, scenario S1 was able to perform better than or as well as scenario S2 (with less than 2.2% error difference) in five cases. Another important factor, which plays a role in model selection is the tendency of a given model to overestimate or underestimate the results. In this research, we provided the average bias values (AB) as well as the visual fluctuation trend of normalized prediction errors (Figure 16 and Figure 17). Now, for each software, we compare the best overall model and the models of similar prediction power (those with ΔAE𝐺𝑗 ≤ 2 ), in terms of average bias. For a given software, if there are multiple models that satisfy the mentioned condition, we consider the model with lowest AB. There are five software, which are qualified for this condition (i.e. Mac, IOS, Linux, IE, and Firefox). For Linux, IE, and Firefox, the absolute value of AB for the best overall model was smaller than the other candidates with ΔAE𝐺𝑗 ≤ 2 by 2.1%, 3.9%, and 1.9%, respectively. This For Mac and IOS, the best overall model has higher absolute bias by 1.8%, and 0.1% difference, respectively. 121 However, in terms of bias, the final decision is depend on researcher’s priorities to select the best model. 8.6.1 Summary of Results In terms of prediction accuracy (AE and HH), considering the OSs and web browsers (eight cases), our presented approach led to more accurate results in seven cases. Out of those cases, the NNM provided the best model in all the cases. Comparing only VDMs, in terms of prediction accuracy, scenario S1 was able to perform better than or as well as scenario S2 (with less than 2.2% error difference) in five cases. Please remember that all the conclusions we draw from this research and their validity are limited by our database uncertainties. 8.7 Discussion We believe that the NNM's better execution contrasted with VDMs originates from the capacity of the NNM in foreseeing the nonlinearity nature of the vulnerability discovery process as a time series. Moreover, a common assumption in most VDMs is the pure S-shaped curve for vulnerability discovery process or considering a discovery function with a monotonic disclosure rate with constant total number of vulnerabilities. However, in practice, the vulnerability discovery process of a given software may have several linear and saturation phases as the total number of vulnerabilities may change as the result of introducing newer software versions. Furthermore, VDMs and traditional time-series functions only have one set of parameters for estimation. NNMs due to having multilayer perceptron structure, having various neurons per layer, and 122 utilizing diverse arrangement of parameters per neuron yield a structure with higher complexity for prediction. In terms of overall magnitude of bias (i.e., absolute value of AB), out of the seven cases that scenario S1 performed better, the best model from scenario S1 outperformed the best VDMs from scenario S2 (those with ΔAE𝐺𝑗 ≤ 2) in five cases. We believe that, in equivalent precision conditions, in terms of bias, the final decision is up to the specialist to pick the best model dependent on his/her priorities. Be that as it may, from a security perspective, it is better to pick a model, which gives more conservative results, in terms of prediction accuracy. In the current study, out of the seven NNMs that were chosen as the best models, the AB value in three cases (Windows, IOS, and Chrome) is negative. In other words, in these cases, the predictor underestimated the results. It can also be easily inferred from Figure 16 and Figure 17, where for Windows, IOS, and Chrome most of the prediction points associated with the NNMs are located under the X=0 axis. For rest of the cases, the best overall model has come up with positive ABs or conservative results. 8.8 Limitations There are a few limitations to our work that prevents us from expanding our conclusions in a more generalized manner. Like previous chapters, there are some common limitations, as well. One of which is using announced published date of vulnerabilities as their discovery date (cf. Section 4.7). Another limitation is associated with the uncertainties of vulnerability databases and the way in which we combined all vulnerabilities reported for all versions of a given software to have sufficient data for 123 training the models. However this limitation does not only apply to our research and was employed by other researches, as well (cf. Section 4.7). Another limitation is with regard to the availability of public information for exploits. Many vendors and public repositories, with good reason, may not publish information on exploits as that is likely to increase the security risks for the end users of those systems. Responsible hackers are also more likely to not publish their exploits in public fora, as they can report them to the vendors directly. Malicious hackers are more likely to attempt to monetize their discoveries via dark web fora. Hence the predictions we make of publicly known exploits are likely to be underestimates of the true number of all vulnerabilities with exploits. Nevertheless, the approach we describe in this dissertation can be used by vendors and organization who have more information about exploits that they cannot share publicly to calibrate their predictions. 8.9 Summary In this chapter, we evaluated the capability of all vulnerabilities associated with a software in predicting the number exploited ones. We compared two scenarios: S1 (use of all vulnerabilities) and S2 (use only of exploited vulnerabilities). We used eight common vulnerability discovery models (VDMs) for both scenarios as well as a non- linear neural network model (NNM) for the first scenario. Due to insufficient number of exploited vulnerabilities, it was not conceivable to use NNM for the second scenario. We used the aforementioned models for predicting the total number of future vulnerabilities over a prediction period of three years. The mentioned models were applied to vulnerability data associated with four well-known OSs and four well-known web browsers. We evaluated the models in terms of prediction accuracy and prediction 124 bias. The results showed that, out of eight software we analyzed, the first scenario led to more accurate results in seven cases. Moreover, out of these seven cases, the NNM was chose as the best model in all the cases. Comparing only VDMs, in terms of prediction accuracy, the first scenario was able to acceptably approximate the results from the second scenario in five cases (by performing better in two cases and providing less than 2.2% error difference in three cases). This is good since, we do not always have access to exploited vulnerability data, which are scarce, and need to predict their report time based on other publicly accessible information. 125 Chapter 9: Proposed Future Work and Summary of Completed Work 9.1 Introduction Organizations face the issue of how to best allocate their security resources. Thus, they need an accurate method for assessing how many new vulnerabilities will be reported for the software they use in a given time period. Vulnerabilities reports of software systems are widely used by security researchers for security analysis (e.g. software reliability analysis). Researchers have used data from various vulnerability databases to study trends of discovery of new vulnerabilities, proposed various models for fitting the discovery times and predicting when new vulnerabilities may be discovered for a given product. Estimating the discovery times for new vulnerabilities is useful both for vendors of these products as well as the end-users as it can help them with their resource allocation strategies over time. This chapter concludes this dissertation with a summary of the research questions and contributions (9.2), a summary of published work (9.3), and areas of future work (9.4). 9.2 Summary of the research questions and contributions 9.2.1 Summary of dissertation and research questions Among the research conducted on vulnerability modeling, few studies have tried to provide a guideline about which model should be used in a given situation. In addition, to the best of our knowledge, there is no research focusing on modeling 126 exploited vulnerabilities. The only efforts in this area is probabilistic examination of intrusions by [52], [53]. Lack of data is a big barrier on the way to modeling exploited vulnerabilities using current VDMs or the machine learning techniques, which require considerable amount of data for satisfactory training. In other words, assuming the vulnerability data for a software is given, the research questions are the following: RQ1: What models are more accurate for vulnerability discovery process modeling? Should all models be applied every time a new dataset is provided? RQ2: Is there any feature in the vulnerability data could be used as a feature for identifying the most appropriate models for that dataset? RQ3: Can we predict disclosure of exploited vulnerabilities having few number of data points? Is there any way we could apply machine learning algorithms to predict exploited vulnerabilities? RQ4: Is the any link between discovery pattern of all vulnerabilities associated with a given software and its exploited vulnerabilities? 9.2.2 Summary of contributions The goal of this dissertation is to propose a guideline to characterize the vulnerability discovery process using several common software reliability/vulnerability discovery models, also known as Software Reliability Models (SRMs)/Vulnerability Discovery Models (VDMs) a commonly used machine learning technique called Neural Networks. The proposed guideline covers the different aspects of vulnerability modeling including curve fitting and prediction. The data used in this research has been collected from six different public repositories of vulnerability data 127 sources including the National Vulnerability Database (NVD)15, the Common Vulnerabilities and Exposures (CVE) database16, the CVE Details data source17, the Security database18, the SecurityFocus data source19, and the CXSecurity database20. The contributions of this dissertation are as follows:  A new guideline to characterize the vulnerability discovery process has been presented using eight common SRMs/VDMs. The proposed guideline covers vulnerability discovery modeling curve fitting and prediction.  Two strategies to improve curve fitting and prediction accuracy have been considered.  The effect of employing a data manipulation technique (i.e. clustering) on improving the curve fitting and prediction capabilities of the current SRMs/VDMs has been investigated.  The analysis has been implemented to two groups: operating systems (OSs), web browsers, and vendors. More specifically, we selected four OSs (Windows, Mac, IOS, and Linux), four web browsers (Internet Explorer, Safari, Firefox, and Chrome).  We discussed the effect of another data manipulation technique (vulnerability grouping) on prediction capabilities of the VDMs 15 https://nvd.nist.gov 16 https://cve.mitre.org 17 https://cvedetails.com/ 18 https://www.security-database.com/ 19 http://www.securityfocus.com/ 20 https://securityfocus.com/ 128  Our proposed guideline was expanded by applying the mentioned models to only the vulnerabilities in NVD that have been exploited since it is necessary for vendors to identify the vulnerabilities at risk of being exploited, and to find those with the potential of having rapidly an exploit.  We presented a new modeling approach using neural networks and evaluate its prediction capability versus VDMs. The proposed approach provides higher curve fitting and prediction capabilities than current SRMs/VDMs.  A new approach was introduced for modeling and predicting the total number of publicly-known exploited vulnerabilities using all publicly- known vulnerabilities reported for a given software. The proposed approach has higher curve fitting and prediction capabilities than current SRMs/VDMs. This dissertation would contribute to the ‘science of software reliability analysis’ and present a new guideline for vulnerability risk assessment that could be integrated as part of security tools, such as Security Information and Event Management (SIEM) systems. 9.3 Summary of Published Work 9.3.1 Published work A conference paper related to the research presented in this manuscript has been published. The paper, titled Cluster-based Vulnerability Assessment Applied on 129 Operating Systems was presented at the Dependable Computing Conference (EDCC) in September 2017 [58], investigated how cluster-based approach can improve the prediction capability of the NHPP Power-Law model in vulnerability discovery modeling of operating systems. This paper was elected as the distinguished paper of the conference. One journal paper related to the research presented in Chapter 5 of this dissertation was published in September 2018. The paper, titled Cluster-based Vulnerability Assessment of Operating Systems and Web Browsers, published in Computing Journal, is an extend version of [83]. A conference paper, not included in this dissertation research, was presented at the Dependable Computing Conference (EDCC) in September 2017. The paper, titled Application of Routine Activity Theory to Cyber Intrusion Location and Time [84], explored the applicability of criminological theories to cybercriminals in an attempt to learn more about attacker behavior. The daily patterns of attack attempts on a network, recorded over a period of four years, were examined to identify periods of higher intrusion volume. 9.3.2 Additional completed work A conference paper, titled An Empirical Comparison of Grouping Strategies of Vulnerabilities for Modeling Vulnerability Discovery Processes, will be submitted in April 2019 to the Dependable Computing Conference (EDCC) 2019. This paper corresponds to Chapter 6 of this thesis. One journal paper, titled Vulnerability Prediction Capability: A Comparison between Vulnerability Discovery Models and Neural Network Models, will be 130 submitted in Summer 2019 to the Computers & Security journal. This paper corresponds to Chapter 7 of this dissertation. One journal paper related to the research presented in Chapter 6 of this dissertation will be submitted in Summer 2019. The paper, titled Predicting Exploited Vulnerabilities, submitted to IEEE Transactions on Dependable and Secure Computing, corresponds to chapter 8 of this dissertation. 9.4 Future Work It is necessary for vendors to identify the vulnerabilities and their detection pattern. Specifically, the vulnerabilities at risk of being exploited, and find those with the potential of having rapidly an exploit. We showed that we can improve the accuracy for predicting future vulnerabilities whether is it exploited or not. Future work could be focused on exploring other nonlinear model structures using machine learning algorithms, which are capable of catching nonlinear nature of vulnerability data. Among them are Recurrent Neural Network (RNN) models, used for prediction time series, which may perform superior to NNMs at modeling dependencies between two points in a sequence. Generally, in NNMs, we have to choose the length of the input (number of inputs) beforehand. Then, it is not possible to learn functions that depends on the inputs that happened a long time ago. This problem could be solved by having an RNN, which can theoretically store information from arbitrarily long time ago. 131 Appendices Appendix A: Clustering Tables Operating Systems Table 43: NUMBER OF VULNERABILITIES PER OS OS Windows Mac IOS Linux # Vulnerabilities 3434 2908 698 5812 # Labelled Vulnerabilities 2974 (86.6%) 2513 (86.4%) 643 (92.1%) 4533 (78%) # Non-labelled Vulnerabilities 460 (13.4%) 395 (13.6%) 55 (7.9%) 1279 (22%) Table 44: NUMBER OF VULNERABILITIES PER TYPE AND OS Keywords Windows Mac IOS Linux 900 1214 517 2442 Denial of Service (30.26%) (48.31%) (80.40%) (53.87%) 1244 1445 71 969 Execute Code (41.83%) (57.50%) (11.04%) (21.38%) 710 1041 64 1211 Overflow (23.87%) (41.43%) (9.95%) (26.72%) 7 3 13 SQL Injection 0 (0.23%) (0.11%) (0.29%) 387 325 27 601 Obtain Information (13.01%) (12.93%) (4.20%) (13.26%) 579 186 14 431 Gain Privileges (19.47%) (7.40%) (2.18%) (9.51%) Bypass Restriction or 248 258 49 340 Similar (8.34%) (10.27%) (7.62%) (7.50%) 33 19 2 56 Directory Traversal (1.11%) (0.76%) (0.31%) (1.23%) 70 58 11 86 Cross Site Scripting (2.35%) (2.31%) (1.71%) (1.90%) Http Response 2 6 0 0 Splitting (0.08%) (0.13%) 3 3 3 13 CSRF (0.10%) (0.12%) (0.47%) (0.29%) 362 758 8 278 Memory Corruption (12.17%) (30.16%) (1.24%) (6.13%) 132 Table 45: NUMBER OF VULNERABILITIES PER TYPE, CLUSTER (WINDOWS) Windows Keywords 1 2 3 4 5 6 1 253 537 68 41 Denial of Service 0 (0.2%) (88.8%) (100%) (16.3%) (7.1%) 2 262 400 580 Execute Code 0 0 (0.4%) (91.9%) (95.7%) (100%) 62 1 170 59 418 Overflow 0 (8.9%) (0.2%) (59.6%) (11%) (100%) 2 5 SQL Injection 0 0 0 0 (0.3%) (0.9%) 376 1 3 7 Obtain Information 0 0 (81.9%) (0.3%) (0.6%) (1.2%) 517 10 11 20 2 19 Gain Privileges (74.4%) (2.2%) (3.9%) (3.7%) (0.5%) (3.3%) Bypass Restriction or 183 45 1 3 1 15 Similar (26.3%) (9.8%) (0.3%) (0.6%) (0.2%) (2.6%) 1 23 1 8 Directory Traversal 0 0 (0.1%) (5%) (0.2%) (1.4%) 3 61 6 Cross Site Scripting 0 0 0 (0.4%) (13.3%) (1%) Http Response 0 0 0 0 0 0 Splitting 2 1 CSRF 0 0 0 0 (0.3%) (0.2%) 15 1 285 11 50 Memory Corruption 0 (2.2%) (0.2%) (100%) (2%) (8.6%) # Vulnerabilities 695 459 285 537 418 580 Table 46: NUMBER OF VULNERABILITIES PER TYPE, CLUSTER (MAC) Mac Keywords 1 2 3 4 5 6 624 18 36 210 322 4 Denial of Service (90.7%) (14.6%) (11.5%) (50.2%) (34.4%) (11.4%) 615 2 3 377 447 1 Execute Code (89.4%) (1.6%) (1%) (90.2%) (47.7%) (2.9%) 574 14 418 35 Overflow 0 0 (83.4%) (4.5%) (100%) (100%) 3 SQL Injection 0 0 0 0 0 (0.3%) 9 3 312 1 Obtain Information 0 0 (1.3%) (2.4%) (100%) (0.1%) 40 123 1 22 Gain Privileges 0 0 (5.8%) (100%) (0.2%) (62.9%) Bypass Restriction or 5 42 208 3 0 0 Similar (4.1%) (13.5%) (22.2%) (8.6%) 19 Directory Traversal 0 0 0 0 0 (2%) 1 57 Cross Site Scripting 0 0 0 0 (0.3%) (6.1%) Http Response 2 0 0 0 0 0 Splitting (0.2%) 3 CSRF 0 0 0 0 0 (0.3%) 688 5 1 63 1 Memory Corruption 0 (100%) (4.1%) (0.3%) (6.7%) (2.9%) # Vulnerabilities 688 123 312 418 937 35 133 Table 47: NUMBER OF VULNERABILITIES PER TYPE, CLUSTER (IOS) IOS Keywords 1 2 3 4 5 6 7 456 11 0 33 8 0 9 Denial of Service (100%) (39.3%) (94.3%) (30.8%) (23.1%) 0 28 6 0 0 3 34 Execute Code (100%) (46.1%) (6.2%) (87.2%) 0 28 0 35 0 1 0 Overflow (100%) (100%) (2.1%) SQL Injection 0 0 0 0 0 0 0 0 0 0 0 26 1 0 Obtain Information (100%) (2.1%) 0 0 0 1 0 0 13 Gain Privileges (2.9%) (33.3%) Bypass Restriction or 2 0 0 0 0 47 0 Similar (0.4%) (97.9%) 1 0 0 0 0 1 0 Directory Traversal (0.2%) (2.1%) 0 0 11 0 0 0 0 Cross Site Scripting (84.6%) Http Response 0 0 0 0 0 0 0 Splitting 0 0 0 0 0 0 3 CSRF (7.7%) 5 0 0 2 1 0 0 Memory Corruption (1%) (5.7%) (3.8%) # Vulnerabilities 456 28 13 35 26 48 39 Table 48: NUMBER OF VULNERABILITIES PER TYPE, CLUSTER (LINUX) Linux Keywords 1 2 3 4 5 6 7 1579 1 722 93 13 0 34 Denial of Service (100%) (1.8%) (96%) (87.7%) (3.9%) (6.2%) 68 6 213 45 23 608 6 Execute Code (4.3%) (11.1%) (28.3%) (42.4%) (7%) (52.1%) (1.1%) 0 0 752 0 6 433 20 Overflow (100%) (1.8%) (37.1%) (3.7%) 2 0 0 0 2 9 0 SQL Injection (0.1%) (0.6%) (0.8%) 0 0 24 2 28 0 547 Obtain Information (3.2%) (1.9%) (8.5%) (100%) 82 1 41 10 9 275 13 Gain Privileges (5.2%) (1.8%) (5.4%) (9.4%) (2.7%) (23.6%) (2.4%) Bypass Restriction or 0 4 6 0 329 1 0 Similar (7.4%) (0.8%) (100%) (0.1%) 0 54 1 0 1 0 0 Directory Traversal (100%) (0.1%) (0.3%) 0 0 0 0 3 81 2 Cross Site Scripting (0.9%) (6.9%) (0.4%) Http Response 0 0 0 0 0 6 0 Splitting (0.5%) 0 0 0 0 4 9 0 CSRF (1.2%) (0.8%) 0 0 162 106 2 6 2 Memory Corruption (21.5%) (100%) (0.6%) (0.5%) (0.4%) # Vulnerabilities 1579 54 752 106 329 1166 547 134 Table 49: CLUSTER COMPOSITION FOR OSS Cluster OS Cluster Number Prevalent Keywords Name 1 Gain privileges G 2 Obtain Information O 3 DoS, Execute code, Memory corruption DEM 4 DoS D 5 Execute code, Overflow EO 6 Execute code E DoS, Execute code, Overflow, Memory 1 DEOM corruption 2 Gain privileges G 3 Obtain Information O 4 Execute code, Overflow EO 5 Execute code (47.7%) E 6 Overflow, Gain privileges OG 1 DoS D 2 Execute code, Overflow EO 3 Cross site scripting C 4 DoS, Overflow DO 5 Obtain Information O 6 Bypass a restriction B 7 Execute code E 1 DoS D 2 Directory Traversal DT 3 DoS, Overflow DO 4 DoS, Memory corruption DM 5 Bypass a restriction B 6 Execute code (52.1%) E 7 Obtain Information O Web Browsers Table 50: NUMBER OF VULNERABILITIES PER WEB BROWSER Web Browser Explorer Safari Firefox Chrome # Vulnerabilities 1862 994 1784 1906 # Labelled Vulnerabilities 1601 (86.0%) 886 (89.1%) 1375 (77.1%) 1563 (82.0%) # Non-labelled Vulnerabilities 261 (14.0%) 108 (10.9%) 409 (22.9%) 343 (18.0%) 135 Linux IOS Mac Windows Table 51: NUMBER OF VULNERABILITIES PER TYPE AND WEB BROWSER Keywords Explorer Safari Firefox Chrome 757 649 606 993 Denial of Service (47.28%) (73.25%) (44.07%) (63.53%) 1197 627 724 335 Execute Code (74.77%) (70.77%) (52.65%) (21.43%) 739 453 370 473 Overflow (46.16%) (51.13%) (26.91%) (30.26%) 0 0 0 0 SQL Injection (0%) (0%) (0%) (0%) Obtain 145 97 163 102 Information (9.06%) (10.95%) (11.85%) (6.53%) 38 3 40 9 Gain Privileges (2.37%) (0.34%) (2.91%) (0.58%) Bypass 116 64 188 170 Restriction or (7.24%) (7.22%) (13.67%) (10.88%) Similar Directory 4 3 9 8 Traversal (0.25%) (0.34%) (0.65%) (0.51%) Cross Site 48 70 125 63 Scripting (3.00%) (7.90%) (9.09%) (4.03%) Http Response 1 0 1 0 Splitting (0.06%) (0%) (0.07%) (0%) 0 2 9 3 CSRF (0%) (0.23%) (0.65%) (0.19%) Memory 885 508 388 207 Corruption (55.28%) (57.34%) (28.22%) (13.24%) Table 52: NUMBER OF VULNERABILITIES PER TYPE, CLUSTER (INTERNET EXPLORER) Internet Explorer Keywords 1 2 3 4 5 644 9 104 Denial of Service 0 0 (100%) (2.5%) (40.9%) 641 2 200 354 Execute Code 0 (99.5%) (1.6%) (89.7%) (100%) 502 1 223 13 Overflow 0 (77.9%) (0.8%) (100%) (5.1%) SQL Injection 0 0 0 0 0 26 3 116 Obtain Information 0 0 (20.6%) (0.8%) (45.7%) 1 5 32 Gain Privileges 0 0 (0.8%) (1.4%) (12.6%) Bypass Restriction or 1 102 13 0 0 Similar (0.1%) (80.9%) (3.7%) 1 1 2 Directory Traversal 0 0 (0.8%) (0.3%) (0.8%) 41 3 4 Cross Site Scripting 0 0 (32.5%) (0.8%) (1.6%) Http Response 1 0 0 0 0 Splitting (0.4%) CSRF 0 0 0 0 0 636 149 98 2 Memory Corruption 0 (98.8%) (66.8%) (27.7%) (0.8%) # Vulnerabilities 644 126 223 354 254 136 Table 53: NUMBER OF VULNERABILITIES PER TYPE, CLUSTER (SAFARI) Safari Keywords 1 2 3 143 506 Denial of Service 0 (79.9%) (99.8%) 2 123 502 Execute Code (1%) (68.7%) (99%) 1 34 418 Overflow (0.5%) (19%) (82.4%) SQL Injection 0 0 0 94 1 2 Obtain Information (47%) (0.6%) (0.4%) 3 Gain Privileges 0 0 (1.5%) Bypass Restriction 63 1 0 or Similar (31.5%) (0.6%) 2 1 Directory Traversal 0 (1%) (0.6%) 69 1 Cross Site Scripting 0 (34.5%) (0.6%) Http Response 0 0 0 Splitting 2 CSRF 0 0 (1%) 1 507 Memory Corruption 0 (0.6%) (100%) # Vulnerabilities 200 179 507 Table 54: NUMBER OF VULNERABILITIES PER TYPE, CLUSTER (FIREFOX) Firefox Keywords 1 2 3 4 5 100 389 113 4 Denial of Service 0 (46.1%) (99.2%) (76.3%) (2.6%) 138 280 208 95 3 Execute Code (63.6%) (71.4%) (44.5%) (64.2%) (2%) 217 148 5 Overflow 0 0 (100%) (100%) (3.3%) SQL Injection 0 0 0 0 0 Obtain 10 2 151 0 0 Information (4.6%) (0.5%) (100%) 4 1 31 1 3 Gain Privileges (1.8%) (0.2%) (6.6%) (0.7%) (2%) Bypass Restriction 1 1 152 2 32 or Similar (0.5%) (0.2%) (32.5%) (1.3%) (21.2%) Directory 1 7 1 0 0 Traversal (0.2%) (1.5%) (0.7%) Cross Site 122 3 0 0 0 Scripting (26.1%) (2%) Http Response 1 0 0 0 0 Splitting (0.2%) 8 1 CSRF 0 0 0 (1.7%) (0.7%) Memory 238 2 148 0 0 Corruption (60.7%) (0.4%) (100%) # Vulnerabilities 217 392 467 148 151 137 Table 55: NUMBER OF VULNERABILITIES PER TYPE, CLUSTER (CHROME) Chrome Keywords 1 2 3 4 5 868 1 2 1 121 Denial of Service (100%) (1%) (0.9%) (1.8%) (36.8%) 0 2 2 5 326 Execute Code (2%) (0.9%) (9.1%) (99%) 263 3 38 0 169 Overflow (30.3%) (3%) (17.9%) (51.4%) SQL Injection 0 0 0 0 0 0 99 0 0 3 Obtain Information (100%) (0.9%) 0 0 9 0 0 Gain Privileges (4.2%) Bypass Restriction 1 11 154 4 0 or Similar (0.1%) (11.1%) (72.6%) (7.3%) 0 0 8 0 0 Directory Traversal (3.8%) 0 8 0 55 0 Cross Site Scripting (8.1%) (100%) Http Response 0 0 0 0 0 Splitting 0 0 3 0 0 CSRF (1.4%) Memory 70 1 1 0 135 Corruption (8.1%) (1%) (0.5%) (41%) # Vulnerabilities 868 99 212 55 329 Table 56: CLUSTER COMPOSITION FOR WEB BROWSERS Web Cluster Cluster Prevalent Keywords Browser Number Name 1 DoS, Execute code, Overflow, Memory corruption DEOM 2 Bypass a restriction B 3 Execute code, Overflow, Memory corruption EOM 4 Execute code E 5 Obtain Information (45.7%) O 1 Obtain Information (47%) O 2 DoS, Execute code DE 3 DoS, Execute code, Overflow, Memory corruption DEOM 1 Execute code, Overflow EO 2 DoS, Execute code, Memory corruption DEM 3 Execute code (44.5%) E 4 DoS, Execute code, Overflow, Memory corruption DEOM 5 Obtain Information O 1 DoS D 2 Obtain Information O 3 Bypass a restriction B 4 Cross Site Scripting C 5 Execute code E 138 Internet Chrome Firefox Safari Explorer Bibliography [1] H. Okamura, M. Tokuzane, and T. Dohi, “Optimal Security Patch Release Timing under Non-homogeneous Vulnerability-Discovery Processes,” presented at the 20th International Symposium on Software Reliability Engineering, 2009, pp. 120–128. [2] A. A. Y. Mussa, “Quantifying the security risk of discovering and exploiting software vulnerabilities,” Ph.D., Colorado State University, United States -- Colorado, 2016. [3] M. R. Lyu, Ed., Handbook of software reliability engineering. Los Alamitos, Calif. : New York: IEEE Computer Society Press ; McGraw Hill, 1996. [4] P. E. Verissimo et al., “Intrusion-tolerant middleware: the road to automatic security,” IEEE Secur. Priv. Mag., vol. 4, no. 4, pp. 54–62, Jul. 2006. [5] J. A. Ozment, “Vulnerability discovery & software security,” University of Cambridge, 2007. [6] E. Rescorla, “Security holes... Who cares?,” presented at the USENIX Security, 2003. [7] E. Rescorla, “Is finding security holes a good idea?,” IEEE Secur. Priv. Mag., vol. 3, no. 1, pp. 14–19, Jan. 2005. [8] O. H. Alhazmi and Y. K. Malaiya, “Quantitative vulnerability assessment of systems software,” 2005, pp. 615–620. [9] H. Okamura, M. Tokuzane, and T. Dohi, “Quantitative Security Evaluation for Software System from Vulnerability Database,” J. Softw. Eng. Appl., vol. 06, no. 04, p. 15, Apr. 2013. [10] W. A. Arbaugh, W. L. Fithen, and J. McHugh, “Windows of vulnerability: a case study analysis,” Computer, vol. 33, no. 12, pp. 52–59, Dec. 2000. [11] S. Frei, M. May, U. Fiedler, and B. Plattner, “Large-scale Vulnerability Analysis,” in Proceedings of the 2006 SIGCOMM Workshop on Large-scale Attack Defense, New York, NY, USA, 2006, pp. 131–138. [12] S. Frei, D. Schatzmann, B. Plattner, and B. Trammell, “Modeling the Security Ecosystem - The Dynamics of (In)Security,” in Economics of Information Security and Privacy, Springer, Boston, MA, 2010, pp. 79–106. [13] S. Zhang, D. Caragea, and X. Ou, “An empirical study on using the national vulnerability database to predict software vulnerabilities,” presented at the International Conference on Database and Expert Systems Applications, 2011, pp. 217–231. [14] “CVE - Common Vulnerabilities and Exposures (CVE).” [Online]. Available: https://cve.mitre.org/. [Accessed: 15-Jun-2017]. [15] N. Nagappan and T. Ball, “Use of relative code churn measures to predict system defect density,” in Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005., 2005, pp. 284–292. [16] L. Allodi and F. Massacci, “Comparing Vulnerability Severity and Exploits Using Case-Control Studies,” ACM Trans Inf Syst Secur, vol. 17, no. 1, pp. 1:1–1:20, Aug. 2014. 139 [17] S. Woo, O. Alhazmi, and Y. Malaiya, “Assessing Vulnerabilities in Apache and IIS HTTP Servers,” 2006, pp. 103–110. [18] O. H. Alhazmi and Y. K. Malaiya, “Application of Vulnerability Discovery Models to Major Operating Systems,” IEEE Trans. Reliab., vol. 57, no. 1, pp. 14– 22, Mar. 2008. [19] G. R. Hudson, “Program errors as a birth-and-death process,” System Development Corp., Report SP-3011, Dec. 1967. [20] R. Anderson, “Security in open versus closed systems—the dance of Boltzmann, Coase and Moore,” Cambridge University, England, Technical report, 2002. [21] O. H. Alhazmi and Y. K. Malaiya, “Modeling the vulnerability discovery process,” in 16th IEEE International Symposium on Software Reliability Engineering (ISSRE’05), 2005, pp. 10 pp. – 138. [22] J. Kim, Y. K. Malaiya, and I. Ray, “Vulnerability Discovery in Multi-Version Software Systems,” in 10th IEEE High Assurance Systems Engineering Symposium, 2007. HASE ’07, 2007, pp. 141–148. [23] P. L. Li, M. Shaw, J. Herbsleb, B. Ray, and P. Santhanam, “Empirical Evaluation of Defect Projection Models for Widely-deployed Production Software Systems,” in Proceedings of the 12th ACM SIGSOFT Twelfth International Symposium on Foundations of Software Engineering, New York, NY, USA, 2004, pp. 263–272. [24] F. Massacci and V. H. Nguyen, “An Empirical Methodology to Evaluate Vulnerability Discovery Models,” IEEE Trans. Softw. Eng., vol. 40, no. 12, pp. 1147–1162, Dec. 2014. [25] H. Joh and Y. K. Malaiya, “Modeling Skewness in Vulnerability Discovery: Modeling Skewness in Vulnerability Discovery,” Qual. Reliab. Eng. Int., vol. 30, no. 8, pp. 1445–1459, Dec. 2014. [26] J. Y. Kim, “Vulnerability discovery in multiple version software systems : open source and commercial software systems,” Thesis, Colorado State University. Libraries, 2007. [27] A. Ozment and S. E. Schechter, “Milk or wine: does software security improve with age?,” presented at the 15th USENIX Security Symposium, 2006. [28] K. Chan, D. Feng, P. Su, and C. . Nie, “Multicycle vulnerability discovery model for prediction,” J. Softw., vol. 21, no. 9, pp. 2367–2375, 2010. [29] Y. Shin and L. Williams, “An Empirical Model to Predict Security Vulnerabilities Using Code Complexity Metrics,” in Proceedings of the Second ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, New York, NY, USA, 2008, pp. 315–317. [30] Y. Shin, A. Meneely, L. Williams, and J. A. Osborne, “Evaluating Complexity, Code Churn, and Developer Activity Metrics as Indicators of Software Vulnerabilities,” IEEE Trans. Softw. Eng., vol. 37, no. 6, pp. 772–787, Nov. 2011. [31] Y. Shin and L. Williams, “Can traditional fault prediction models be used for vulnerability prediction?,” Empir. Softw. Eng., vol. 18, no. 1, pp. 25–59, Feb. 2013. [32] V. H. Nguyen, S. Dashevskyi, and F. Massacci, “An automatic method for assessing the versions affected by a vulnerability,” Empir. Softw. Eng., vol. 21, no. 6, pp. 2268–2297, Dec. 2016. 140 [33] M. Shahzad, M. Z. Shafiq, and A. X. Liu, “A Large Scale Exploratory Analysis of Software Vulnerability Life Cycles,” in Proceedings of the 34th International Conference on Software Engineering, Piscataway, NJ, USA, 2012, pp. 771–781. [34] H. Joh and Y. K. Malaiya, “Defining and Assessing Quantitative Security Risk Measures Using Vulnerability Lifecycle and CVSS Metrics,” in The 2011 international conference on security and management (sam), 2011, pp. 10–16. [35] M. A. McQueen, T. A. McQueen, W. F. Boyer, and M. R. Chaffin, “Empirical Estimates and Observations of 0Day Vulnerabilities,” in 2009 42nd Hawaii International Conference on System Sciences, 2009, pp. 1–12. [36] M. Bozorgi, L. K. Saul, S. Savage, and G. M. Voelker, “Beyond Heuristics: Learning to Classify Vulnerabilities and Predict Exploits,” in Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 2010, pp. 105–114. [37] C. Steve and R. A. Martin, “Vulnerability type distributions in CVE,” Mitre report, 2007. [38] A. Vega, P. Bose, and A. Buyuktosunoglu, Rugged Embedded Systems: Computing in Harsh Environments. Morgan Kaufmann, 2016. [39] M. Bernaschi, E. Gabrielli, and L. V. Mancini, “Remus: A Security-enhanced Operating System,” ACM Trans Inf Syst Secur, vol. 5, no. 1, pp. 36–61, Feb. 2002. [40] R. A. Johnson and D. W. Wichern, Applied multivariate statistical analysis, 6th ed. Upper Saddle River, N.J: Pearson Prentice Hall, 2007. [41] C. Sabottke, S. Octavian, and T. Dumitras, “Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits,” presented at the USENIX Security, 2015, vol. 15. [42] A. Younis, Y. K. Malaiya, and I. Ray, “Assessing vulnerability exploitability risk using software properties,” Softw. Qual. J., vol. 24, no. 1, pp. 159–202, Mar. 2016. [43] K. Lee, J. Kim, K. H. Kwon, Y. Han, and S. Kim, “DDoS attack detection method using cluster analysis,” Expert Syst. Appl., vol. 34, no. 3, pp. 1659–1665, Apr. 2008. [44] S. Huang, H. Tang, M. Zhang, and J. Tian, “Text Clustering on National Vulnerability Database,” in 2010 Second International Conference on Computer Engineering and Applications, 2010, vol. 2, pp. 295–299. [45] M. Bozorgi, “A machine learning framework for classifying invulnerabilites and predicting exploitability,” University of California, San Diego, 2009. [46] P. Mell, K. Scarfone, and S. Romanosky, “A complete guide to the common vulnerability scoring system version 2.0.” FIRST-Forum of Incident Response and Security Teams, 2007. [47] L. Allodi and F. Massacci, “A Preliminary Analysis of Vulnerability Scores for Attacks in Wild: The Ekits and Sym Datasets,” in Proceedings of the 2012 ACM Workshop on Building Analysis Datasets and Gathering Experience Returns for Security, New York, NY, USA, 2012, pp. 17–24. [48] J. Zheng, H. Okamura, and T. Dohi, “Mean Time to Security Failure of VM-Based Intrusion Tolerant Systems,” in 2016 IEEE 36th International Conference on Distributed Computing Systems Workshops (ICDCSW), 2016, pp. 128–133. 141 [49] J. Zheng, H. Okamura, and T. Dohi, “Survivability Analysis of VM-Based Intrusion Tolerant Systems,” IEICE Trans. Inf. Syst., vol. E98-D, no. 12, pp. 2082–2090, Dec. 2015. [50] C. Luo, H. Okamura, and T. Dohi, “Optimal planning for open source software updates,” Proc. Inst. Mech. Eng. Part O J. Risk Reliab., vol. 230, no. 1, pp. 44– 53, Feb. 2016. [51] A. M. Algarni and Y. K. Malaiya, “A consolidated approach for estimation of data security breach costs,” in 2016 2nd International Conference on Information Management (ICIM), 2016, pp. 26–39. [52] B. B. Madan, K. Goševa-Popstojanova, K. Vaidyanathan, and K. S. Trivedi, “A method for modeling and quantifying the security attributes of intrusion tolerant systems,” Perform. Eval., vol. 56, no. 1–4, pp. 167–186, 2004. [53] H. K. Browne, W. A. Arbaugh, J. McHugh, and W. L. Fithen, “A trend analysis of exploitations,” in Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001, Oakland, CA, USA, 2001, pp. 214–229. [54] A. A. Younis, H. Joh, and Y. Malaiya, “Modeling Learningless Vulnerability Discovery using a Folded Distribution,” in Proceedings of the International Conference on Security and Management (SAM), 2011, pp. 617–623. [55] L. Allodi, “The Heavy Tails of Vulnerability Exploitation,” in Engineering Secure Software and Systems, 2015, pp. 133–148. [56] T. Y. Yang and L. Kuo, “Bayesian computation for the superposition of nonhomogeneous poisson processes,” Can. J. Stat., vol. 27, no. 3, pp. 547–556, Sep. 1999. [57] Scikit-learn developers, sklearn.model_selection.StratifiedKFold. 2017. [58] Y. Movahedi, M. Cukier, A. Andongabo, and I. Gashi, “Cluster-based Vulnerability Assessment Applied to Operating Systems,” presented at the 13th European Dependable Computing Conference, Geneva, Switzerland, 2017. [59] D. N. Gujarati and D. C. Porter, Basic Econometrics. McGraw-Hill Irwin, 2009. [60] L. Mentaschi, G. Besio, F. Cassola, and A. Mazzino, “Problems in RMSE-based wave model validations,” Ocean Model., vol. 72, pp. 53–58, Dec. 2013. [61] S. R. Hanna, D. W. Heinold, A. P. I. H. and E. A. Dept, and E. R. & T. Inc, Development and application of a simple method for evaluating air quality models. American Petroleum Institute, 1985. [62] Y. Movahedi, M. Cukier, A. Andongabo, and I. Gashi, “Cluster-based vulnerability assessment of operating systems and web browsers,” Computing, Sep. 2018. [63] SAS® Enterprise MinerTM 14.2: High-Performance Procedures. Cary, NC: SAS Institute Inc., 2016. [64] W. S. Sarle, “Cubic Clustering Criterion,” SAS Institution Inc., Cary, NC, SAS® Technical Report A-108, 1983. [65] R. Tibshirani, G. Walther, and T. Hastie, “Estimating the number of clusters in a data set via the gap statistic,” J. R. Stat. Soc. Ser. B Stat. Methodol., vol. 63, no. 2, pp. 411–423, Jan. 2001. [66] K. P. Burnham and D. R. Anderson, “Multimodel Inference: Understanding AIC and BIC in Model Selection,” Sociol. Methods Res., vol. 33, no. 2, pp. 261–304, Nov. 2004. 142 [67] Y. Roumani, J. K. Nwankpa, and Y. F. Roumani, “Time series modeling of vulnerabilities,” Comput. Secur., vol. 51, pp. 32–40, Jun. 2015. [68] L. Wang, Y. Zeng, and T. Chen, “Back propagation neural network with adaptive differential evolution algorithm for time series forecasting,” Expert Syst. Appl., vol. 42, no. 2, pp. 855–863, Feb. 2015. [69] N. R. Pokhrel, H. Rodrigo, and C. P. Tsokos, “Cybersecurity: Time Series Predictive Modeling of Vulnerabilities of Desktop Operating System Using Linear and Non-Linear Approach,” J. Inf. Secur., vol. 08, no. 04, pp. 362–382, 2017. [70] A. A. Adebiyi, A. O. Adewumi, and C. K. Ayo, “Comparison of ARIMA and Artificial Neural Networks Models for Stock Price Prediction,” J. Appl. Math., vol. 2014, pp. 1–7, 2014. [71] C. Bennett, R. A. Stewart, and C. D. Beal, “ANN-based residential water end-use demand forecasting model,” Expert Syst. Appl., vol. 40, no. 4, pp. 1014–1023, Mar. 2013. [72] N. Kourentzes, D. K. Barrow, and S. F. Crone, “Neural network ensemble operators for time series forecasting,” Expert Syst. Appl., vol. 41, no. 9, pp. 4235– 4244, Jul. 2014. [73] A. Aslanargun, M. Mammadov, B. Yazici, and S. Yolacan, “Comparison of ARIMA, neural networks and hybrid models in time series: tourist arrival forecasting,” J. Stat. Comput. Simul., vol. 77, no. 1, pp. 29–53, Jan. 2007. [74] H. G. Hosseini, D. Luo, and K. J. Reynolds, “The comparison of different feed forward neural network architectures for ECG signal diagnosis,” Med. Eng. Phys., vol. 28, no. 4, pp. 372–378, May 2006. [75] D. N. Gujarati and D. C. Porter, Basic Econometrics. McGraw-Hill Irwin, 2009. [76] R. May, G. Dandy, and H. Maier, “Review of input variable selection methods for artificial neural networks,” in Artificial neural networks-methodological advances and biomedical applications, InTech, 2011. [77] S. Siami-Namini and A. S. Namin, “Forecasting Economics and Financial Time Series: ARIMA vs. LSTM,” ArXiv Prepr. ArXiv180306386, 2018. [78] G. Zhang, B. Eddy Patuwo, and M. Y. Hu, “Forecasting with artificial neural networks:,” Int. J. Forecast., vol. 14, no. 1, pp. 35–62, Mar. 1998. [79] A. A. Younis and Y. K. Malaiya, “Comparing and Evaluating CVSS Base Metrics and Microsoft Rating System,” in 2015 IEEE International Conference on Software Quality, Reliability and Security, 2015, pp. 252–261. [80] R. Adhikari and R. K. Agrawal, “An introductory study on time series modeling and forecasting,” ArXiv Prepr. ArXiv13026613, 2013. [81] C. N. Babu and B. E. Reddy, “A moving-average filter based hybrid ARIMA– ANN model for forecasting time series data,” Appl. Soft Comput., vol. 23, pp. 27– 38, Oct. 2014. [82] A. Phinyomark, A. Nuidod, P. Phukpattaranont, and C. Limsakul, “Feature extraction and reduction of wavelet transform coefficients for EMG pattern classification,” Elektron. Ir Elektrotechnika, vol. 122, no. 6, pp. 27–32, 2012. [83] Y. Movahedi, M. Cukier, A. Andongabo, and I. Gashi, “Cluster-based vulnerability assessment of operating systems and web browsers,” Computing, vol. 101, no. 2, pp. 139–160, Feb. 2019. 143 [84] K. Bock, S. Shannon, Y. Movahedi, and M. Cukier, “Application of Routine Activity Theory to Cyber Intrusion Location and Time,” in 2017 13th European Dependable Computing Conference (EDCC), 2017, pp. 139–146. 144