ABSTRACT Title of dissertation: IMPROVING THE USABILITY OF STATIC ANALYSIS TOOLS USING MACHINE LEARNING Ugur Koc Doctor of Philosophy, 2019 Dissertation directed by: Professor Dr. Adam A. Porter Department of Computer Science Professor Dr. Jeffrey S. Foster Department of Computer Science Static analysis can be useful for developers to detect critical security flaws and bugs in software. However, due to challenges such as scalability and undecidabil- ity, static analysis tools often have performance and precision issues that reduce their usability and thus limit their wide adoption. In this dissertation, we present machine learning-based approaches to improve the adoption of static analysis tools by addressing two usability challenges: false positive error reports and proper tool configuration. First, false positives are one of the main reasons developers give for not using static analysis tools. To address this issue, we developed a novel machine learning approach for learning directly from program code to classify the analysis results as true or false positives. The approach has two steps: (1) data preparation that trans- forms source code into certain input formats for processing by sophisticated machine learning techniques; and (2) using the sophisticated machine learning techniques to discover code structures that cause false positive error reports and to learn false positive classification models. To evaluate the effectiveness and efficiency of this approach, we conducted a systematic, comparative empirical study of four families of machine learning algorithms, namely hand-engineered features, bag of words, re- current neural networks, and graph neural networks, for classifying false positives. In this study, we considered two application scenarios using multiple ground-truth program sets. Overall, the results suggest that recurrent neural networks outper- formed the other algorithms, although interesting tradeoffs are present among all techniques. Our observations also provide insight into the future research needed to speed the adoption of machine learning approaches in practice. Second, many static program verification tools come with configuration op- tions that present tradeoffs between performance, precision, and soundness to allow users to customize the tools for their needs. However, understanding the impact of these options and correctly tuning the configurations is a challenging task, requiring domain expertise and extensive experimentation. To address this issue, we devel- oped an automatic approach, auto-tune, to configure verification tools for given target programs. The key idea of auto-tune is to leverage a meta-heuristic search algorithm to probabilistically scan the configuration space using machine learning models both as a fitness function and as an incorrect result filter. auto-tune is tool- and language-agnostic, making it applicable to any off-the-shelf configurable verification tool. To evaluate the effectiveness and efficiency of auto-tune, we ap- plied it to four popular program verification tools for C and Java and conducted experiments under two use-case scenarios. Overall, the results suggest that running verification tools using auto-tune produces results that are comparable to config- urations manually-tuned by experts, and in some cases improve upon them with reasonable precision. IMPROVING THE USABILITY OF STATIC ANALYSIS TOOLS USING MACHINE LEARNING by Ugur Koc Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2019 Advisory Committee: Professor Dr. Adam A. Porter, Co-chair Professor Dr. Jeffrey S. Foster, Co-chair Professor Dr. Jeffrey W. Herrmann, Dean?s Representative Professor Dr. Marine Carpuat Professor Dr. Mayur Naik ?c Copyright by Ugur Koc 2019 Acknowledgments I owe my gratitude to all the people who have made this dissertation thesis possible. Because of them, my graduate experience has been the one I will cherish forever. First, I?d like to thank my academic advisors Professor Dr. Adam A. Porter and Professor Dr. Jeffrey S. Foster, for giving me a unique opportunity to work on inspiring and challenging research projects over the five and half years of my doctoral study. They have always made themselves available for help and advice. Special thanks to Dr. Shiyi Wei for his invaluable advice and outstanding contributions to the research projects we worked on together that became the main parts of this thesis. Many thanks to Dr. ThanhVu H. Nguyen for the phenomenal tutoring he provided during the first two years of my study. Special thanks to my academic advisor in my master?s study, Professor Dr. Cemal Yilmaz, for helping me set the foundations of my academic life. I?d also like to thank my other research collaborators Parsa Saadatpanat, Austin Mordahl, Dr. Marine Carpuat, Dr. James Hill, Dr. Rajeev R. Raje, Zachary P. Reynolds, and Abhinandan B. Jayanth for the expertise they brought into our projects and all of the contributions that made this dissertation possible. I want to acknowledge the financial support from the Science and Research Council of Turkey (TUBITAK), National Science Foundation of United States, De- partment of Homeland Security for all the projects discussed herein. Nobody has been more supportive of me in the pursuit of this study than the ii members of my family. I want to thank my parents, whose love and guidance are always with me in whatever I pursue. In addition, I?d like to express my gratitude to my partner Veronika for the support and care she provided for me. She was always ready to listen to me and motivate me to push further. Last but not least, I would like to thank my friends Leonard, Cabir, Juan, and Dr. Musa, for always being positive and supportive. Thank you for believing in me. iii Table of Contents Acknowledgements ii Table of Contents iv List of Tables vii List of Figures viii List of Abbreviations ix 1 Introduction 1 1.1 Addressing False Positives of Static Bug Finders . . . . . . . . . . . . 2 1.2 Configurability of Program Verification Tools . . . . . . . . . . . . . . 6 2 Background Information 10 2.1 Program Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Supervised Classification for False Positive Detection . . . . . . . . . 11 2.2.1 Hand-engineered Features . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Bag of Words . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.3 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . 13 2.2.4 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . 16 3 Learning a Classifier for False Positive Error Reports Emitted by Static Anal- ysis Tools 18 3.1 Code Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.1 Subject Static Analysis Tool and Warning Type . . . . . . . . 24 3.3.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.3 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4.1 Naive Bayes Classifier Analysis . . . . . . . . . . . . . . . . . 31 3.4.2 LSTM Classifier Analysis . . . . . . . . . . . . . . . . . . . . . 34 3.4.3 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . 38 3.5 Attributions and Acknowledgments . . . . . . . . . . . . . . . . . . . 38 iv 4 An Empirical Assessment of Machine Learning Approaches for Triaging Re- ports of a Java Static Analysis Tool 40 4.1 Adapting Machine Learning Techniques to Classify False Positives . . 42 4.1.1 Hand-engineered Features . . . . . . . . . . . . . . . . . . . . 42 4.1.2 Program Slicing for Summarization . . . . . . . . . . . . . . . 43 4.1.3 Bag of Words . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1.4 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . 46 4.1.4.1 Data Cleansing and Tokenization (Tcln) . . . . . . . 46 4.1.4.2 Abstracting Numbers and String Literals (Tans) . . . 47 4.1.4.3 Abstracting Program-specific Words (Taps) . . . . . . 48 4.1.4.4 Extracting English Words From Identifiers (Text) . . 48 4.1.5 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . 49 4.1.5.1 Using Kind, Operation, and Type Fields (KOT) . . . 50 4.1.5.2 Extracting a Single Item in Addition to KOT (KOTI) 50 4.1.5.3 Node Encoding Using Embeddings (Enc) . . . . . . . 50 4.2 Tool and Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.4 Analysis of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4.1 RQ1: Overall Performance Comparison . . . . . . . . . . . . . 59 4.4.2 RQ2: Effect of Data Preparation . . . . . . . . . . . . . . . . 63 4.4.3 RQ3: Variability Analysis . . . . . . . . . . . . . . . . . . . . 65 4.4.4 RQ4: Further Interpreting the Results . . . . . . . . . . . . . 66 4.4.5 Threats To Validity . . . . . . . . . . . . . . . . . . . . . . . . 71 4.5 Attributions and Acknowledgments . . . . . . . . . . . . . . . . . . . 71 5 Auto-tuning Configurable Program Verification Tools 73 5.1 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.2 Our Auto-tuning Approach . . . . . . . . . . . . . . . . . . . . . . . 81 5.2.1 Meta-heuristic Configuration Search . . . . . . . . . . . . . . . 83 5.2.2 Learning the Fitness Function and Filter . . . . . . . . . . . . 86 5.2.3 Neighboring Configuration Generation . . . . . . . . . . . . . 89 5.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.4.1 Subject Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.4.2 Benchmark Programs . . . . . . . . . . . . . . . . . . . . . . . 93 5.4.3 Ground-truth Datasets . . . . . . . . . . . . . . . . . . . . . . 94 5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 97 5.5.2 RQ1: How close can auto-tune get to the expert knowledge? 99 5.5.3 RQ2: Can auto-tune improve on top of expert knowledge? . . 101 5.5.4 RQ3: How do different neighbor generation strategies affect auto-tune?s performance? . . . . . . . . . . . . . . . . . . . . 104 5.5.5 RQ4: How do different machine learning techniques (i.e., clas- sification and regression) affect auto-tune?s performance? . . 107 5.5.6 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . 108 v 5.6 Attributions and Acknowledgments . . . . . . . . . . . . . . . . . . . 109 6 Related Work 110 6.1 Related Work for Automatic False Positive Classification . . . . . . . 110 6.2 Natural Language Processing (NLP) Techniques Applied to Code . . 112 6.3 Selection and Ranking of Static Analyses . . . . . . . . . . . . . . . . 113 6.4 Adaptive Sensitivity Static Analysis . . . . . . . . . . . . . . . . . . . 115 7 Conclusion 118 8 Future Work 124 8.1 Future Work for False Positive Detection . . . . . . . . . . . . . . . . 124 8.2 Future Work for auto-tune . . . . . . . . . . . . . . . . . . . . . . . 125 A Program Analysis Techniques 127 A.1 Taint Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 A.2 Model-checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 A.3 Symbolic Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Bibliography 130 vi List of Tables 3.1 Performance results for Naive Bayes and LSTM. . . . . . . . . . . . . 30 3.2 Important instructions for classification . . . . . . . . . . . . . . . . . 32 3.3 Precision improvements that can be achieved by using the LSTM models as a false positive filter. . . . . . . . . . . . . . . . . . . . . . 34 4.1 Programs in the real-world benchmark. . . . . . . . . . . . . . . . . . 53 4.2 BoW, LSTM, and GGNN approaches . . . . . . . . . . . . . . . . . . 54 4.3 Recall, precision and accuracy results for the approaches in Table 4.2 and four most accurate algorithms for HEF, sorted by accuracy. The numbers in normal font are median of 25 runs, and numbers in smaller font semi-interquartile range (SIQR). The dashed-lines separate the approaches that have high accuracy from others at a point where there is a relatively large gap. . . . . . . . . . . . . . . . . . . . . . . 60 4.4 Number of epochs and training times for the LSTM and GGNN ap- proaches. Median and SQIR values as in Table 4.3 . . . . . . . . . . . 61 4.5 Dataset stats for the LSTM approaches. For the sample length, num- bers in the normal font are the maximum and in the smaller font are the mean. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.1 Subject verification tools. . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.2 Data distribution in the ground-truth datasets (aggregated) . . . . . 94 5.3 The number of correct and incorrect results and the computed pre- cision for two auto-tune settings, named as S1 and S2, with base neighbor generation strategy and classification model, ? = 0.1 for S1, and ? = 0.4 for S2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.4 The number of configurations generated (c?), accepted (c), improved the best so far (c?), and used for running tool. . . . . . . . . . . . . . 107 5.5 Training performance. . . . . . . . . . . . . . . . . . . . . . . . . . . 107 vii List of Figures 2.1 Structure of a standard recurrent neural network (RNN) . . . . . . . 14 2.2 Structure of a standard LSTM . . . . . . . . . . . . . . . . . . . . . . 15 3.1 Learning approach overview. . . . . . . . . . . . . . . . . . . . . . . . 18 3.2 The LSTM model unrolled over time. . . . . . . . . . . . . . . . . . . 24 3.3 An example Owasp program that FindSecBugs generates a false pos- itive error report for (simplified for presentation). . . . . . . . . . . . 26 3.4 LSTM color map for two correctly classified backward slices. . . . . . 37 4.1 Sample PDG Node created with Joana program analysis framework (simplified for presentation) . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Venn diagrams of the number of correctly classified examples for HEF-J48, BoW-Freq, LSTM-Ext, and GGNN-KOT approaches, av- erage for 5 models trained. . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3 An example program (simplified) from the OWASP benchmark that was correctly classified only by LSTM-Ext (A) and the sequential representation used for LSTM-Ext (B) . . . . . . . . . . . . . . . . . 69 5.1 Code examples from the SV-COMP 2018. . . . . . . . . . . . . . . . 78 5.2 Workflow of our auto-tuning approach. . . . . . . . . . . . . . . . . . 81 5.3 Distribution of analysis results for each sample configuration . . . . . 95 5.4 auto-tune improvements with classification models as the number of conclusive analysis tasks for varying threshold values. The search runs only-if comp-default can not complete. Each stacked bar shows the distribution of results for each neighbor generation strategy. The number on top of each bar is difference between auto-tune?s score and the comp-default configuration score. . . . . . . . . . . . . . . . 102 5.5 auto-tune improvements with regression models. . . . . . . . . . . . 103 A.1 An example code vulnerable for SQL injection. . . . . . . . . . . . . . 127 viii List of Abbreviations API Application Programming Interface SA Static Analysis PA Program Analysis SV Software Verification PV Program Verification SV-COMP Software Verification Competition BMC Bounded Model Checking ML Machine Learning NLP Natural Language Processing RNN Recurrent Neural Networks LSTM Long Short-Term Memories GNN Graph Neural Networks GGNN Gated Graph Neural Networks AST Abstract Syntax Tree CFG Control Flow Graph PDG Program Dependency Graph SDG System Dependency Graph CA Covering Array SQL Structured Query Language ix Chapter 1: Introduction Static analysis (SA) is the process of analyzing a software program?s code to find facts about the security and quality of the program without executing it. There are many static analysis tools ?e.g., security checkers, bug detectors, and program verifiers? that automatically perform this process to identify and report weaknesses and flaws in a software program that might jeopardize its integrity. In this respect, static analysis tools can aid developers in detecting and fixing problems in their software early in the development process, when it is usually the cheapest to do so. However, several usability issues affect their performance and precision and thus limit their wide adoption in software development practice. First, they are known to generate large numbers of spurious reports, i.e., false positives. Simplifying greatly, this happens because the tools rely on approxima- tions and assumptions that help their analyses scale to large and complex software systems. The tradeoff is that while analyses can become faster, they also become more imprecise, leading to more and more false positives. As a result, developers often find themselves sifting through many false positives to find and solve a small number of real flaws. Inevitably, developers often stop inspecting the tool?s output altogether, and the real bugs found by the tool go undetected [1]. 1 Second, many static program verification tools come with analysis options that allow their users to customize their operation and control the simplifications to the task to be completed. These options often present tradeoffs between performance, precision, and soundness. Understanding these tradeoffs is, however, a challenging task very often, requiring domain expertise and extensive experiments. In practice, users, especially non-experts, often run verifiers on the target program with a provided ?default? configuration to see if it produces desirable outputs. If it does not, the user often does not know how to modify the analysis options to produce better results. We believe the challenges above have prevented many program verification tools from being used to their full potential. As software programs spread to every area of our lives and take over many critical jobs like performing surgery and driving cars, solving these challenges becomes more and more essential to assure correctness, quality, and performance at lower costs. 1.1 Addressing False Positives of Static Bug Finders There have been decades of research efforts attempting to improve the precision of static bug finders, i.e., reducing the spurious results. One line of research aims at developing better program analysis algorithms and techniques [2, 3, 4, 5]. Although, in theory, these techniques are smarter and more precise, in practice, their realistic implementations have over-approximations in modeling the most common language features [6]. Consequently, false positives persist. Another line of research aims at 2 taking a data-driven approach, relying on machine learning techniques to classify and remove false positive error reports after the analysis completed. At a high level, these work extract set of features from analysis reports and the programs being analyzed (e.g., kind of the problem being reported, lines of code in the program, the location of the warning, etc.) and train classification (or ranking) models with labeled datasets (i.e., supervised classification) to identify and filter false positive analysis reports [7, 8, 9, 10]. Although these supervised classification techniques have proven themselves to be an excellent complement to the algorithmic static analysis as they take a data- centric approach and learn from past mistakes, manually extracted feature-based approaches have some limitations. First, manual feature extraction can be costly, as it requires domain expertise to select the relevant features for a given language, analysis algorithm, and problem. Second, the set of features used for learning clas- sification models for specific settings (i.e., programming language, analysis problem, and algorithm) are not necessarily useful for other settings ?i.e., they are not gen- eralizable. Third, such features are often inadequate for capturing the root causes of false positives. When dealing with an analysis report, developers review their code with data and control dependencies in focus. Such dependency insights are not likely to be covered by a fixed set of features. We hypothesize that adding detailed knowledge of a program?s source code and structure to the classification process can lead to more effective classifiers. Therefore, we developed a novel learning approach for learning a clas- sifier from the codebases of the analyzed programs [11] (Chapter 3). The approach 3 has two steps. The first step is data preparation that attempts to remove extraneous details to reduce the code to a smaller form of itself that contains only the relevant parts for the analysis finding (i.e., error report). Then, using the reduced code, the second step is to learn a false positive classifier. To evaluate this approach, we conducted a case study of a highly used Java security bug finder (FindSecBugs [12]) using the OWASP web vulnerabilities bench- mark suite [13] as the dataset for learning. In particular, we experimented with two code reduction techniques, which we called method body and backward slice (see Section 3.1) and two machine learning algorithms: Naive Bayesian inference and long short-term memories (LSTM) [14, 15, 16, 17]. Our experimental results were positive. In the best case with the LSTM models, the proposed approach correctly detected 81% of false positives while misclassifying only 2.7% of real problems (Chap- ter 3). In other words, we could significantly improve the precision of the subject tool from 49.6% to 90.5% by using this classification model as a post-analysis filter. Next, we extended the false positive classification approach with more precise data preparation techniques. We also conducted a systematic empirical assessment of four different machine learning techniques for supervised false positive classifi- cation; hand-engineered features (state-of-the-art), bag of words, recurrent neural networks, and graph neural networks [18] (Chapter 4). Our initial hypothesis is that data preparation will have a significant effect on learning and the gener- alizability of learned classifiers. We designed and developed three sets of code transformations. The first set of transformations extract the subset of program?s codebase that is relevant for a given analysis report. These transformations have 4 a significant impact on the performance of the approach as they reduce the code dramatically. The second set of transformations project the reduced code onto a generic space free of program-specific words via abstraction and canonicalization, so not to memorize the program-specific words in training and avoid overfitting. These transformations are essential for the generalizability of the learned classifiers. The last set of transformations tokenize the code. These transformations will also impact the performance as they will determine the vocabulary to learn. In our experiments, we used multiple ground-truth program analysis datasets with varying levels of data preparation under two application scenarios. The first scenario is when the classification models are learned from and used for the same programs, while the second scenario is when the classification models are learned from some programs, but they are later used for different programs (i.e., training and test sets are consist of different sets of non-overlapping programs). To the best of our knowledge, the first scenario is the widely ?and the only? studied one in the literature. Other than the OWASP benchmark used in the case study presented in Chap- ter 3, we created two more datasets from a program analysis benchmark of real-world programs that we also created to use in this empirical assessment. These real-world datasets enable us to address critical research questions about the performance and generalizability of the approach. Moreover, the varying level of data preparations helps us to test our initial hypothesis about the effect of data preparation for the different application scenarios considered. Overall, our results suggest that recur- rent neural networks (which learn over a program?s source code) outperformed the 5 other learning techniques, although interesting tradeoffs are present among all tech- niques, more precise data preparation improves the generalizability of the learned classifiers. Our results also suggest that the second application scenario presents interesting challenges for the research field. Our observations provide insight into the future research needed to speed the adoption of machine learning approaches in practice (Chapter 4). 1.2 Configurability of Program Verification Tools Recent studies have shown that configuration options indeed present tradeoffs [19], especially when different program features are present [20, 21, 22]. Researchers have proposed various techniques that selectively apply a configuration option to certain programs or parts of a program (i.e., adaptive analysis), using heuristics de- fined manually or learned with machine learning techniques [20, 23, 21, 24, 22, 25]. Although a promising research direction, these techniques are currently focused on tuning limited kinds of analysis options (e.g., context-sensitivity). In addition, supervised machine learning techniques have recently been used to improve the us- ability of static analysis tools. The applications include classifying, ranking, or prioritizing analysis results [9, 10, 26, 27, 7, 28, 29], and ranking program verifica- tion tools based on their likelihood of completing a given task [30, 31]. However, the configurability of program verification tools has not been considered in these applications. We believe that focusing on automatically selecting configurations will make verification tools more usable and allow them to better fulfill their potential. 6 Therefore, we designed and developed a meta-reasoning approach, auto-tune, to automatically configure program verification tools for given target programs (Chapter 5). We aim to develop a generalizable approach that can be applied for various tools that are implemented in and targeted at different programming lan- guages. We also aim to develop an efficient approach that can effectively search for a desirable configuration in large spaces of configurations. Our approach lever- ages two main ideas to achieve these goals. First, we use prediction models both as fitness functions and incorrect result filters. Our prediction models are trained with language-independent features of the target programs and the configuration options of the subject verification tools. Second, we use a meta-heuristic search al- gorithm that searches the configuration spaces of verification tools using the models mentioned above. Overall, auto-tune works as follows: we first train two prediction models for use in the meta-heuristic search algorithm. We use a ground-truth program analysis dataset that consists of correct, incorrect, and inconclusive1 analysis runs. The first model, the fitness function, is trained on the entire dataset; the second model, the incorrect result filter (or, for short, filter), is trained on the conclusive part of the dataset?i.e., excluding the inconclusive analysis runs. Our search algorithm starts with a default configuration of the tool if available; otherwise, it starts with a random configuration. The algorithm then systematically, but non-deterministically, alters this configuration to generate a new configuration. Throughout the search, 1An inconclusive analysis run means the tool fails to come to a judgment due to a timeout, crash, or a similar reason. 7 the fitness function and filter are used to decide whether a configuration is a good candidate to run the tool with. The algorithm continues to scan the search space by generating new configurations until it locates one that both meets the thresholds in the fitness and filter functions and leads to a conclusive analysis result when run. We consider auto-tune as a meta-reasoning approach [32, 33] because it aims to reason about how verification tools should reason about a given verification task. In this setting, the reasoning of a given verification tool is controlled by configura- tion options that enable/disable certain simplifications or assumptions throughout the analysis tasks. The ultimate goal of meta-reasoning is to identify a reasoning strategy, i.e., a configuration, that is likely to lead to the desired verification result. We applied auto-tune to four popular software verification tools. CBMC and Symbiotic [34, 35] verify C/C++ programs, while JBMC [36] and JayHorn [37] verify Java programs. We generated program analysis datasets with the ground truths from the SV-COMP2, an annual competition of software verification that includes a large set of both Java and C programs. We used these datasets, which contain between 55K and 300K data points, to train prediction models (i.e., fitness functions and false result filters) for each tool. To evaluate the effectiveness of auto-tune, we considered two use cases. First, to simulate the scenario when a non-expert uses a tool without a reliable default configuration, we start auto-tune with a random configuration. Our experiments suggest that auto-tune produces results comparable to configurations manually and 2https://sv-comp.sosy-lab.org/2019 8 painstakingly selected by program analysis experts, i.e., comp-default3. Second, to simulate the scenario in which a tool comes with a reliable default configuration, or is used by an expert, we start auto-tune with the comp-default configura- tion. Our results suggest that, with regard to the competition?s scoring system, auto-tune improves the SV-COMP performance for three out of four program ver- ification tools we studied Symbiotic, JayHorn, and JBMC. For CBMC, auto-tune also significantly increases the number of correct analysis runs. However, it did not improve the competition score due to the substantial penalty for the few incorrect results it generated (Chapter 5). The remainder of this document is organized as follows: Chapter 2 provides background information; Chapter 3 presents the learning approach for classify- ing false positive analysis results with a case study; Chapter 4 presents an empirical assessment of different machine learning techniques for classifying false positive anal- ysis results; Chapter 5 presents the auto-tune approach with empirical evaluations; Chapter 6 surveys the related work; Chapter 7 summarizes the conclusions; and Chapter 8 discusses potential directions for future work. 3These are the configurations the tools used when participating the competition. 9 Chapter 2: Background Information In this chapter we provide high-level background information on static analysis and the machine learning algorithms we studied. 2.1 Program Analysis Program analysis techniques aim at analyzing software programs to find useful facts about the programs? properties like correctness and safety. These techniques can be divided in two categories: dynamic analysis and static analysis. Dynamic analysis techniques execute programs to discover these facts, while static analysis techniques use sophisticated algorithms and formal methods to analyze the code of the programs without executing them. Dynamic analyses, software testing techniques specifically, are widely adopted in software development, i.e., software development companies and individual soft- ware developers routinely perform certain kinds of testing to check properties of their programs that are vital for security and integrity [38, 39, 40]. The situation, however, is not the same for static analysis. Many software companies and develop- ers have abandoned the use of static analysis tools [41, 1, 42]. The usability concerns we highlighted in our introduction, i.e., false positives and properly configuring the 10 tools, are two of the leading causes of this situation. Although the prior research addressing these usability concerns has had some success, the proposed approaches themselves had limitations about generalizability and usability. Thus they have not created a significant impact on the adoption of static analysis in industry. The main motivation behind the approaches we present in this dissertation is to change the situation in favor of static analysis tools by solving the mentioned usability issues in generalizable ways. Therefore, we used sophisticated machine learning techniques to automate key steps in our approaches. 2.2 Supervised Classification for False Positive Detection In this section, we provide brief background information on four machine learning approaches we studied for supervised false positive classification: hand- engineered features (HEF), bag of words (BoW), recurrent neural networks (RNN), and graph neural networks (GNN). HEF approaches can be regarded as the state-of-the-art classification applica- tion for the false positive detection problem [43]. However, by design, they cannot include the deep structure of the source code being analyzed. The other three ap- proaches add an increasing amount of structural information as we move from BoW, to RNN, and to GNN. To the best of our knowledge, BoW, RNN, and GNN have not been used to solve this problem before. In this dissertation, we frame the false positive detection as a standard bi- nary classification problem [44]. Given an input vector ~x, e.g., a point in a high 11 dimensional space RD, the classifier produces an output y = f?(~x), where y = 1 for false positives, y = 0 otherwise. Constructing such a classifier requires defining an input vector ~x that captures features of programs that might help in detecting false positives. We also need to select a function f?, as different families of functions encode different inductive biases and assumptions about how to predict outputs for new inputs. Once these two decisions have been made, we can train the classifier by estimating its parameters ? from a large set of known false positives and true positives; i.e., {(x1, y1) . . . (xN , yN)}. 2.2.1 Hand-engineered Features A feature vector ~x can be constructed by asking experts to identify measur- able properties of the program and analysis report that might be indicative of true positives or false positives. Each property can then be represented numerically by one or more elements in ~x. Hand-engineered features have been defined to classify false positive static analysis reports in existing work [45, 46, 9, 10]. Tripp et al. [10] identified lexical, quantitative, and security-specific features to filter false cross-site scripting (XSS) vulnerability reports for JavaScript programs. Note that identifying these features requires expertise in web application security and JavaScript programming language. Later in Chapter 4.1.1, we describe our adaptation of this work. 12 2.2.2 Bag of Words How can we represent a program as a feature vector that contains useful infor- mation to detect false positives? We take inspiration from text classification prob- lems, where classifier inputs are natural language documents, and ?Bag of Words? (BoW) features provide simple yet effective representations [47]. BoW represents a document as a multiset of the words found in the document, ignoring their order. The resulting feature vector ~x for a document has as many entries as words in the dictionary, and each entry indicates whether a specific word exists in the document. BoW has been used in software engineering research as an information re- trieval technique to solve problems such as duplicate report detection [48], bug localization [49], and code search [50]. Such applications often use natural language descriptions provided by humans (developers or users). To our knowledge, BoW has not been used to classify analysis reports. 2.2.3 Recurrent Neural Networks BoW features ignore the order. For text classification, recurrent neural net- works [51, 52] have emerged as a powerful alternative approach that views the text as an (arbitrary-length) sequence of words and automatically learns vector repre- sentations for each word in the sequence [47]. RNNs process a sequence of words with arbitrary-lengthX = ?x0, x1, . . . , xt, . . . , xn? from left to right, one position at a time. In contrast to standard feedforward neu- ral networks, RNNs have looping connections to themselves. For each position t, 13 RNNs compute a feature vector ht as a function of the observed input xt, and the representation learned for the previous position ht?1, i.e., ht = RNN(xt, ht?1). Once the sequence has been read, the output vectors ?h0, h1, . . . , hn? can be combined in certain ways (e.g., mean pooling) to create a vector representation to be used for supervised classification with a standard machine learning algorithm. During training, the parameters of the classifier and the parameters of the RNN function are estimated jointly. As a result, the vectors ht can be viewed as feature representations for xt that are learned from data, implicitly capturing rele- vant context knowledge about the sequence prefix ?x0, . . . xt?1? due to the structure of the RNN. Unlike HEF or BoW, the feature vectors ht are directly optimized for the classification task. This advantage comes at the cost of reduced interpretability since the values of ht are much harder for humans to interpret than the BoW or HEF. Figure 2.1: Structure of a standard recurrent neural network (RNN) Figure 2.1 shows a standard RNN unit. The blue circle is the input token xt, the yellow box is the tanh activation function, and the purple circle is the output value ht for xt. However, researchers have noted that this RNN structure suffers from the vanishing and exploiting gradient problems [53]. Very simply, these problems occur during the error back-propagation, specifically with RNNs because 14 of the recurrent connection (how f? uses ht?1). Figure 2.2: Structure of a standard LSTM To address this issue, Hochreiter et al. [14] introduced a more sophisticated RNN model called long short-term memories (LSTM). Figure 2.2 shows a standard LSTM unit. In addition to the ht?1 recurrence, LSTM has one more recurrent connection called the memory, C. The yellow boxes are neural network layers. The first one from the left is called the forget gate, f . It takes xt and ht?1 as input and outputs a vector of real numbers between 0 and 1 using the sigmoid function, ?. With this vector, LSTM determines what to keep and what to throw away from the recurrent memory Ct?1; 0 means completely forgetting all, and 1 means keeping all. The second and the third layers are for determining what new information will be stored in the current memory. The sigmoid layer (second yellow box from the left) is called the input gate, and it determines the values to be updated, it. The tanh layer computes the updates C?t. Then, the new memory is computed by combining the remembered part of the old memory with the new updates with the addition 15 operation following the tanh. The last ? layer computes the output value, which will be merged with the updated memory with a tanh function as the final output value ht. Thanks to the forget gate and memory components, LSTM is capable of capturing long-term dependencies in sequential data, and it does not have the vanishing and exploiting gradients problem. Various versions of RNNs, including the LSTM, are commonly used in natural language processing (NLP) tasks such as predicting the upcoming word in typed text [54, 55] and machine translation [56], where feature vectors ht capture syntax and semantic information about words based on their usage in context [14, 52]. Recently, researchers have begun to use LSTMs to solve software engineering tasks such as code completion [57], and code synthesis [58, 59]. To our knowledge, RNNS have not been used to classify analysis reports. 2.2.4 Graph Neural Networks With RNNs, we can represent programs as a sequence of tokens. However, programs actually have a more complex structure that might be better represented with a graph. To better represent structure, we explore graph neural networks which compute vector representations for nodes in a graph using information from neighboring nodes [60, 61]. The graphs are of the form G = ?N,E?, where N = n0, n1, . . . , ni is the set of nodes, and E = e1, e2, . . . , ej is the set of edges. Each node ni is represented with a vector hi which captures learned features of the node in the context of the graph. 16 The edges are of the form e = ?type, source, dest?, where type is the type of the edge; and source and dest are the IDs of the source and destination nodes, respectively. The vectors hi are computed iteratively, starting with arbitrary values at time t = 0, and incorporating information from neighboring nodes NBR(ni) at (t) (t?1) each time step t, i.e., hi = f(ni, hNBR(ni)). 17 Chapter 3: Learning a Classifier for False Positive Error Reports Emitted by Static Analysis Tools In Out Source Code Preprocessing Learning Code from Patterns Code Static Code Code Analyzer Reduction False Labeling Positive Reduced Bug Reports Report Code Reports Filter Figure 3.1: Learning approach overview. Now we explain our learning approach for reducing false positives of static analysis tools. Figure 3.1 depicts the approach with its two steps: preprocessing and learning. It takes as input the source code of the analyzed program and a list of error reports emitted by a static analysis tool for the program. For each error report, we start by reducing the source code to a subset of itself to isolate the code locations that are relevant to the error report. Next, we label the error reports by manually examining the source code. Lastly, we do supervised classification with the reduced code to discover code structures that are correlated with false positive error reports and to learn a classifier that can filter out false positive error reports 18 emitted by the static analysis tool in the future. 3.1 Code Preprocessing In initial work [62], we manually identified 14 core code patterns that lead to false positive error reports. We observed that in all code patterns, the root cause of the false error report often spans over a small number of program locations. To better document these patterns, we performed manual code reduction to remove the parts of the code that are not relevant for the error report, i.e., the parts that do not affect the analysis result. After this manual reduction, the resulting program is effectively the smallest code snippet that still leads to the same false positive error report from the subject static analysis tool. In this work, we develop approaches to automate the code reduction step. Such reduction is crucial because code segments that are not relevant to the error report may introduce noise, causing spurious correlations and over-fitting. Now, we explain the reduction techniques we apply in the case study described in Section 3.3: method body and program slicing. Method body. As a naive approach, we simply took the body of the method that contains the warning line in it (referred to as ?warning method? later in the text). Note that, many of the code locations relevant for the error report are not inside the body of the warning method. In many cases, the causes of the report span multiple methods and classes. Hence, this reduction is not a perfect way of isolating relevant code locations. However, if we can detect patterns in such sparse data, our models 19 are likely to be on the right track. Program slicing. Given a certain point in a program, program slicing is a tech- nique that reduces that program to its minimal form, called a slice, whose execution still has the same behavior at that point [63]. The reduction is made by remov- ing the code that does not affect the behavior at the given point. Computing the program slice from the warning line up to the entry point of a program would give us a backward slice which covers all code locations that are relevant for the error report (in theory). In practice, slicing can be expensive. We will explain how we configured an industrial scale framework, WALA [64], for computing the backward slice later in Section 3.3 with more detail. 3.2 Learning Filtering false positive error reports can be viewed as a binary classification problem with the classes True Positive and False Positive. In this binary classifi- cation problem, we have two primary goals; 1) discovering code structures that are correlated with these classes, and 2) learning a classifier to detect false positive er- ror reports (see Figure 3.1). Towards achieving these goals, we explore two different learning approaches. First, we use a simple Naive Bayes inference-based learning model. Second, we use a neural network-based language model called LSTM [14]. The first approach is simple and interpretable. The second approach is more sophis- ticated and it can learn more complex patterns in the data. Naive Bayesian Inference. We formulate the problem as calculating the probabil- 20 ity that an error report is either a true positive or a false positive, given the analyzed code. So, the probability of the error report being a false positive is P (e=0|code) where e=0 means there is no error in the code, i.e., the error report is a false pos- itive. Since there are only two classes, the probability of being a true positive can be computed as P (e=1|code) = 1? P (e=0|code). To calculate the probability P (e=0|code), we use a simple Bayesian inference: P (code|e = 0)P (e = 0) P (e = 0|code) = P (code) P (code|e = 0)P (e = 0) = P (code|e = 0)P (e = 0) + P (code|e = 1)P (e = 1) Where P (e=0) and P (e=1) are respectively the percentages of false positive and true positive populations in the dataset, and P (code) is the probability of getting this specific code from the unknown distribution of all codes. To calculate P (code|e=0) and P (code|e=1), we formulate the code as a sequence of instructions (bytecodes), i.e., code=< I1, I2, I3, ..., In >. So we rewrite P (code|e=0) as, P (code|e = 0) = P (I1, I2, ..., In|e = 0) = P (I1|e = 0)P (I2, ..., In|I1, e = 0) = P (I1|e = 0)P (I2|I1, e = 0)P (I3, ...In|I1, I2, e = 0) ... = P (I1|e = 0)P (I2|I1, e = 0)...P (In|I1, I2, ..., e = 0) 21 Algorithm 1 Computing Probabilities 1: for each code C in Dataset do 2: for each instruction I in C do 3: count[C.isTruePositive][I]++ 4: total[C.isTruePositive]++ 5: for each instruction I do 6: P (I|e = 1)? count[True][I]/total[True] 7: P (I|e = 0)? count[False][I]/total[False] To calculate each probability, we need to count the number of times each combina- tion occurs in the dataset. However, for a complicated probability like P (In|I1, I2, ..., e = 0), we need to have a huge dataset to be able to estimate it accurately. To avoid this issue, we simplify this probability by assuming a Markov property. For this analysis, the Markov property means that the probability of seeing each instruction is conditionally independent (i.e., conditioned on e) of any other instruction in the code. Although this assumption is not likely to be true for flow-sensitive properties of code, it still helps us build an initial model to have an intuition of what is happening in the dataset (our second model does not need this assumption). With the Markov property, the underlying probability becomes: P (code|e = 0) = P (I1|e = 0)P (I2|e = 0)...P (In|e = 0) Calculating each of the P (Ii|e=0) is very straightforward. We count the num- ber of times instruction Ii appears in any false positive example, and we divide it by 22 the total number of instructions in all of the false positive examples. Algorithm 1 shows how to calculate these probabilities. Line 3 counts the number of times each instruction appears in true positive and false positive examples. Then, line 4 counts the total number of instructions in each class. Finally, lines 8 and 9 compute the probabilities for instructions. Long Short Term Memory (LSTM). Here, we took inspiration from the senti- ment analysis problem in natural language processing. Sentiment analysis is com- monly framed as a binary classification problem with classes positive and negative, like we did for our problem in this study. To benefit from neural network models that a proven to be effective for the sentiment analysis problem, we convert pro- grams into sequence of tokens. Then for a given sequence, that is the reduced code version of a program, we want to predict a label. We think LSTM is a good fit for this task because programs have long-term dependencies. For examples, variable def-use pairs, method calls with arguments, and accessing class fields are some of the program structures which would form long-term dependencies in code. These dependencies are often relevant in deciding whether or not an error report is a false positive. Carrier et al.[65] designed a single layer LSTM model for the sentiment anal- ysis problem. In this work, we adopt this simple LSTM model using the adadelta optimization algorithm [66]. To be able to make some observations by visualizing the inner workings of the model, we prefer having fewer (four) cells, each of which is an LSTM (see Figure 3.4). Finding the optimum number of cells is not in the scope of this work. Following the LSTM layer, there is a mean pooling layer to compute 23 Figure 3.2: The LSTM model unrolled over time. a general representation of the input sequence. Finally, we do logistic regression to classify the data into one of the two classes. Figure 3.2 shows the structure of the LSTM unrolled over time, with n being the length of the longest sequence. 3.3 Case Study This section presents the case study we conducted to evaluate the effectiveness of the proposed approach. 3.3.1 Subject Static Analysis Tool and Warning Type In this case study, we focus on the SQL (Structured Query Language) injection flaw type. As the subject static analysis tool, we use the FindSecBugs plug-in of FindBugs [12, 67] (a widely-used security checker for Java). This plug-in performs taint analysis to find SQL injection flaws. Very simply, taint analysis checks for data-flow from untrusted sources to safety-critical sink points (see Appendix A for 24 more detailed description of taint analysis). For example, one safety-critical sink point for SQL injection is the Connection.execute(String) Java statement. A string parameter passed to this method is considered as tainted if it comes from an untrusted source such as an HTTP cookie or a user input (both are untrusted sources because malicious users can leverage them to attack a system). FindSecBugs emits an SQL injection report in such cases to warn the user of a potential security problem. However, for complex source code, it may be challenging to determine whether or not a given parameter is tainted. For example, programs may receive user input that becomes a part of an SQL query string (see Appendix A for an example). In such cases, the best practice is to perform security checks against an injection threat in the code. Such checks are often called sanitization or neutralization. When the chain of information flow becomes too complicated, the static analysis tool may not be able to track the sanitization correctly and might, therefore, emit a false positive error report. Note that, the proposed approach is not restricted to SQL injection flaws, FindSecBugs, or taint analysis. They are just the focus of this case study. Our learners do not make use of any information that is specific to these factors. Fur- thermore, in the next chapter, we will show that it can be possible to train a model that works for multiple flaw types or static analysis tools (see Chapter 4). 25 1 public class BenchmarkTest16536 extends HttpServlet { 2 @Override 3 public void doPost(HttpServletRequest request){ 4 String param = ""; 5 Enumeration headers = request.getHeaders("foo"); 6 if (headers.hasMoreElements()) { 7 param = headers.nextElement(); 8 } 9 String bar = doSomething(param); 10 String sql = "{call verifyUserPassword(?foo?,?" + bar + "?)}"; 11 Connection con = DatabaseHelper.getConnection(); 12 con.prepareCall(sql).execute(); // A false positive SQLi warning 13 } // end doPost 14 private static String doSomething(String param){ 15 String bar = "safe!"; 16 HashMap map = new HashMap(); 17 map.put("keyA", "a_Value"); 18 map.put("keyB", param.toString()); 19 map.put("keyC", "another_Value"); 20 bar = (String) map.get("keyB"); 21 bar = (String) map.get("keyA"); 22 return bar; 23 } /* end doSomething*/ 24 } /* end class*/ Figure 3.3: An example Owasp program that FindSecBugs generates a false positive error report for (simplified for presentation). 3.3.2 Data One of the biggest challenges for our problem is to find a sufficient dataset on which to train. We know of no publicly available benchmark datasets containing real- world programs with labeled error reports emitted by static analysis tools. However, there are at least two benchmark suites developed to evaluate the performance of static analysis tools; Juliet [68] and Owasp benchmark [13]. These benchmark suites consist of programs that exercise common weaknesses [69]. Note that not all programs in the benchmark suites really have an actual weakness. Roughly half of 26 the programs are designed in certain ways that may trick static analysis tools into emitting false reports. For this case study, we focused on the Owasp benchmark suite as it has a bigger dataset for the SQL injection flaw type. This dataset has 2371 data points; 1193 false positive and 1178 true positive error reports. Figure 3.3 shows an example Owasp program for which FindSecBugs generates an error report. At line 7, the param variable gets a value from an HTTP header element, which is considered to be a tainted source. The param variable is then passed to the doSomething method as an argument. In the doSomething method, starting at line 14, the tainted param argument is put into a HashMap object (line 18). Next, it is read back from the map into the bar variable at line 20. At this point, the bar variable has a tainted value. However, the program then gets a new value from the map, which is this time a hard-coded string, i.e., a trusted source. Finally, doSomething returns this hard-coded string, which gets concatenated into the sql variable at line 10. Then a callable statement is created and executed (lines 11 and 12). To summarize, the string concatenated with the SQL is hard-coded and thus does not represent an injection threat. Therefore, the error report is a false positive. 3.3.3 Preprocessing For simplicity, we focus on the bytecode representation of the programs. With bytecode, there are fewer program-specific tokens and syntactic components than that found in source code, making it much easier to work on for a classification 27 model. In contrast, in the source code, there might be multiple instructions in a single line, and what each instruction is doing is, therefore, less easy to understand. For the SQL injection dataset, we applied the two code reduction techniques (described in Section 3.1), leading to two different reduced datasets called ?method body? and ?backward slice? respectively. Application of method body reduction is straightforward; we simply take the bytecode for the body of the warning method. Next, we describe the implementation details of the backward slice technique. Tuning WALA. We use the WALA [64] program analysis framework for computing the backward slice with respect to a warning line. In theory, this slice should cover all code locations related to the error report. However, program slicing is unsolvable in general and not scalable most of the time [63]. In fact, we experienced excessive execution times when computing backward slices even for the simple short Owasp programs. To avoid this problem, we configured the WALA program slicer to narrow the scope and limit the amount of analysis it does for computing the slice. First, we restricted the set of data dependencies by ignoring exception objects, base pointers, and heap components. We assume that exception objects are not relevant to the error report. Base pointers and heap variables, on the other hand, are just represented as indexes in the bytecode, over which our models cannot adequately handle, so we discarded them. Second, we set the entry points as close to the warning method as possible. An entry point is usually where a program starts to execute. Therefore it is the place where the backward slice should end. By default, this point would be the main method of the program. For the Owasp suite, however, there is a large amount of 28 code in the main method that is common for all programs. Since this shared code is unlikely to be relevant to any error reports, we rule it out by setting the warning method as the entry point for Owasp. Third, we exclude Owasp utility classes, Java classes, and all classes of third party libraries as none of them are relevant to the error report for this case study. With this exclusion, we are not removing the references to these classes. Instead, we are treating them as a black box. With the WALA tuning mentioned here, we are now able to compute a modified backward slice for Owasp programs in reasonable times. Note that, although WALA analyzes bytecode, the slice it outputs differs from bytecode with a few points. For presentation purposes, WALA uses some additional instructions like new, branch, return, which do not belong to Java byte- code1. Therefore, the dictionaries of the method body dataset and the backward slice dataset are not the same. Now, we explain the further changes we performed for both datasets. First of all, we removed program-specific tokens and literal expressions because they may give away whether the error report is a true positive or a false positive. For the LSTM Classifier, we do this by deleting literal expressions and replacing program- specific objects with UNK OBJ and method calls with UNK CALL. For the Naive Bayes Classifier, we do so by simply removing them all. Note that, this step is also neces- sary to be able to generalize the classifier across programs. If we let the model learn from program-specific components, then it will not be able to do a good job on the 1These instructions appear in WALA IR (intermediate representation) 29 code that does not share the same ingredients. Lastly, for the Naive Bayes Classifier, we remove all arguments to instructions except the invoke instructions (invokeinterface, invokevirtual, etc.). With the invoke instructions, we also keep the names of invoked classes. This is done to further simplify the dataset. Furthermore, we treat all kinds of invoke instructions as the same by simply replacing them with invoke. For the LSTM Classifier, we tokenized the data by whitespace, e.g., with the invoke instructions, the instruction itself is one token and the class being invoked is one token. Therefore, when analyzing results, we use the word ?token? for LSTM and ?instruction? for Naive Bayes. 3.4 Results and Analysis For all experiments, we randomly split the dataset into an 80% training set and a 20% test set. Table 3.1 summarizes the results. Accuracy is the percentage of correctly classified samples. Recall is the percentage of correctly classified false positive samples in all false positive samples, and the precision is the percentage of samples classified as false positive. All three of the metrics are computed using the test portion of the datasets. Training (%) Classifier dataset Time (m) Recall Precision Accuracy method body 0.02 60 64 63 Naive Bayes backward slice 0.03 66 75 72 method body 17 81.3 97.3 89.6 LSTM backward slice 18 97 78.2 85 Table 3.1: Performance results for Naive Bayes and LSTM. 30 3.4.1 Naive Bayes Classifier Analysis For the analysis of results, we consider any instruction I to be independent of SQL injection flaw if the value [ ] P (I|e = 0) ? 0.5 P (I|e = 0) + P (I|e = 1) is smaller than 0.1 in magnitude. We call this value ?False Positive Dependence?, and it ranges from ?0.5 to 0.5 inclusive, where large positive values mean the in- struction is correlated with the false positive class. Large negative values mean it is correlated with the true positive class. Values around zero mean the instruc- tion is equally likely to appear in both true positive and false positive classes (i.e., P (I|e=0) ' P (I|e=1)) and therefore is independent of SQL injection flaw. We started the experiment by running the Naive Bayes Classifier on the method body dataset. Although the accuracy result is not very high for this ex- periment (63%, in Table 3.1), it confirmed that the bytecode contains recognizable signals indicating false positive error reports. The Naive Bayes Classifier learns the conditional property of each instruction, given that an error exists. We observed that instructions like iload, ifne, etc., are equally likely to appear in both true and false positive samples. Therefore, their ?False Positive Dependence? value is below the threshold (0.1), and these instructions are independent of SQL injection flaws. Next, the only instruction we found to be correlated with the false positive class 31 is invoke java.util.HashMap (Table 3.2). By manually examining the Owasp suit, we see that, it is a common pattern to insert a tainted string and also a safe string into a HashMap, and then extract the safe string from the HashMap to become a part of an SQL statement (see Figure 3.3 for an example). This is done to ?trick? static analysis tools into emitting incorrect reports, and the Naive Bayes Classifier correctly identifies this situation. For the second experiment, we did the same analysis for the backward slices dataset. We hypothesize that analyzing the code outside the method body improves the effectiveness of learning. Therefore, we need to consider all relevant instructions, even if they are outside the method body. Backward slices provide this information by including all instructions in the program that may be relevant to the error report. False Positive Dependence Instructions Method body Backward slice invoke esapi.Encoder ?0.09 ?0.36 invoke java.util.ArrayList 0.04 0.18 invoke java.util.HashMap 0.18 0.25 Table 3.2: Important instructions for classification Running the Naive Bayes model on the backward slice dataset confirms our findings in the method body dataset. In addition to HashMap, this model also learns from the backward slices that ArrayList invocation is highly correlated with false positives, and Encoder invocation is highly correlated with true positives. The False Positive Dependence for significant classes invoked is shown in Table 3.2. These correlations can easily be justified by examining the code. In Owasp, ArrayList is used to trick the analyzer, much like HashMap did in 32 the previous discussion. Furthermore, Encoder is mainly used in the dataset for things like HTML encoding but not for SQL escaping. This pattern is used to trick the analyzer into missing some true positive samples, which our model identifies as well. The main reason for improved results in the backward slice dataset is that the data points in it include all relevant instructions and the irrelevant instructions, which might act as noise, have been removed. This increases the confidence of the classifier. Table 3.2 shows that dependence values have increased in magnitude for backward slices, which means the classifier is more confident about the effect of these instructions in the code. Weaknesses. To better understand the limitations of the Bayes model, we exam- ined some of the incorrectly classified examples. We observed that the model can still identify the influence of each instruction correctly. However, in those examples, multiple instructions are correlated with true positives while a single (or very few) instruction makes the string safe. By its nature, the Naive Bayes model cannot take into account that a single instruction is enough to make the string untainted. The instructions are correlated with true positive class weight more when comput- ing the overall probability and the Naive Bayes model ends up classifying the code incorrectly. 33 Training Analysis precision (%) dataset initial after filtering method body 49.6 90.5 backward slice 49.6 98.0 Table 3.3: Precision improvements that can be achieved by using the LSTM models as a false positive filter. 3.4.2 LSTM Classifier Analysis With the LSTM classifier, we achieved 89.6% and 85% accuracy for the method body and backward slice datasets, respectively (Table 3.1). The classifier trained on the method body dataset is very precise, i.e., 97.3% of the error reports classified as false positive are indeed false positives. However, it misses 18.7% of false positives, i.e., classifying them as true positives. Using this classifier as a false positive report filter would significantly improve the tools precision from 49.6% to 90.5% (the first row if Table 3.3). The situation is reversed for the classifier trained on the backward slice dataset. It catches 97% of the false positives but also filters out many true positive reports, i.e., 21.8% of the samples classified as false positive are indeed true positive. This translates to a greater improvement in the tools precision from 49.6% to 98% (the second row if Table 3.3). We examined a sample program which was correctly classified by the classi- fier trained on the method body dataset, but incorrectly classified by the classifier trained on the backward slice dataset. We observed that many instructions that only exist in the method body dataset, like aload 0, i const 0, dup, etc., are found to be important by the classifier. There may be two reasons why these instructions are not in the backward slice dataset: either because they do not have any effect on 34 the warning line, or because of the tuning we did for WALA. The first case, learning the instructions that are not related to the warning line, would be over-fitting the noise. The second case, however, requires a more in-depth examination that we defer to the future work (see Chapter 4). Just relying on the first case, we think that the classifier trained on the backward slice dataset is more generalizable as this dataset has lesser noise and more report relevant components. Hence, in the rest of this section, we will only analyze the classifier trained on the backward slice dataset. Understanding the source of the LSTM?s high performance is very challeng- ing as we cannot fully unfold its inner workings. Nevertheless, we can visualize the output values of some cells, as suggested by Karpathy et al. [70]. Figure 3.4 illustrates the output values of four cells for two correctly classified backward slices by coloring the background. The latter is the slice computed for the false positive sample in Figure 3.3. The former is the slice computed for a true positive, which is structurally similar to the latter with two critical differences; 1) the doSomething method is defined in an inner class, and 2) in the doSomething method HTML encoder API methods are called instead of the HashMap operations. If the LSTM model finds an input token important for the classification task it will produce an output (i.e., the ht vector) that is large in magnitude. The cyan, yellow, and white background colors in Figure 3.3 mean positive, negative, and under threshold (?0.35) output values, respectively. There is only one shade white, but for cyan and yellow, the darker the background, the larger the LSTM output (ht) is (in magnitude). Note that, last tokens are the labels; ?truepositive? and ?falsepositive?. 35 Now, we discuss some interesting observations from Figure 3.4. First, due to the memory component of the LSTM, the background color of a token (output for that token) does not solely depend on that token but is affected by the history. For example, looking at Cell 1, the first invokeinterface token has a yellow back- ground color in both samples. Since there is no history before the first token, this yellowness is solely due to that token. On the other hand, looking at Cell 4, the first lines of both samples mostly the same except that there is one token with cyan back- ground color in the true positive sample; eq. The only difference in the first lines is the tainted sources invoked, which are HttpServletRequest.getHeaderNames and HttpServletRequest.getHeaders String in true positive and false positive, respectively. Therefore, the only reason why eq token is interesting only in the true positive sample must be the invocation of the tainted source HttpServletRequest.getHeaderNames. This demonstrates a good example of a long-term dependency, as there are six other tokens in between. Second, in both samples, all cells have a high output for the Enumeration.nextElement token, which is highly relevant for the error report as it is the tainted source. Note that, all cells treat this token the same way in both samples. Similarly, all cells have a very high output for the last return instructions in both samples. However, this time, the output is negative in the true positive sample and positive in the false positive sample (in the first three cells and vice versa for in the last cell), which happens due to the history of tokens. This situation illustrates the LSTM?s ability to infer the context in which these tokens appear. Next, looking at the false positive sample in Figure 3.3, we see that the core 36 Figure 3.4: LSTM color map for two correctly classified backward slices. reason for being a false positive resides in the body of the doSomething method. In particular, HashMap put and get instructions are very critical. All cells have a very high output for a subset of the tokens that correspond to that instruc- tions. These high output values for HashMap put and get instructions match the findings of the Bayesian model. Furthermore, all cells go very yellow for the Encoder.encodeForHTML method call in the true positive sample. These high re- sults for Encoder tokens are also consistent with the findings of the Bayesian model. Lastly, Figure 3.4 shows that although most of the high output values are reasonable and interpretable, there are still many that we cannot explain. This situation is common with neural networks, and we will continue to explore it in future work. 37 3.4.3 Threats to Validity Like any empirical study, our findings are subject to threats to internal and external validity. For this case study, we were primarily concerned with threats to external validity since they limit our ability to generalize the results of our studies. All of the significant threats to generalizability are related to the represen- tativeness of the dataset used in the case study. First, OWASP programs are not truly representative of real-world programs. They are not large in size, and they do not handle any particular functionality other than exercising the targeted weak- ness. Nevertheless, they are still a good starting point for our problem. Second, the datasets only cover one type of flaw, SQL injection, emitted by one static analysis tool, FindSecBugs. However, FindSecBugs is a plug-in of FindBugs, which we think is a good representative of open-source static analysis tools. Next, with a manual review of the source code of FindSecBugs we see that it performs the same analysis for all other types of security flaws it checks such as command injection, LDAP injection, and cross-site scripting (XSS). Therefore, FindSecBugs and SQL injection flaw is a good combination representing security flaws and checkers. Lastly, we only experiment with Java bytecode. The next chapter presents additional work that addresses many of the threats discussed here (Chapter 4). 3.5 Attributions and Acknowledgments The study presented in this chapter is sponsored by the Department of Home- land Security under grant #D15PC00169 and published in the proceedings of the 1st 38 ACM SIGPLAN International Workshop on Machine Learning and Programming Languages [71]. I, Ugur Koc, came up with the idea of learning from source code and designed the neural network-based learning approach to realize the idea. I also designed and conducted the LSTM experiments and analyzed the results. The devel- opment, experimentation, and analysis tasks of the Naive Bayesian inference-based learning approach are carried out by the collaborator Parsa Saadatpanah. 39 Chapter 4: An Empirical Assessment of Machine Learning Approaches for Triaging Reports of a Java Static Analysis Tool In the previous chapter, we introduced a learning approach for classifying false positive error reports and evaluated its effectiveness with a case study of a static analysis tool. In this case study, we experimented with a Naive Bayesian inference model and a LSTM model, which could capture source code-level characteristics that may have led to false positives. This evaluation showed that LSTM significantly improved classification accuracy, compared to a Bayesian inference-based approach. Given the limited data set involved in the case study, however, further research is called for. Overall, while existing research suggests that there are benefits to applying machine learning algorithms to classify analysis results, there are important open research questions to be addressed before these techniques are likely to be routinely applied to this use case. First and foremost, there has been relatively little exten- sive empirical evaluation of different machine learning algorithms for false positive detection. Such empirical evaluation is of great practical importance for under- standing the tradeoffs and requirements of classification techniques. Second, the effectiveness and generalizability of the features used and data preparation tech- 40 niques needed for different classification techniques have not been well-investigated for actual usage scenarios. Third, there is also a need for more extensive, real-world program datasets to better validate the findings of prior work, which was primarily conducted on synthetic benchmarks. These open problems leave uncertainty as to which approaches to use and when to use them in practice. To partially address these limitations, in this chapter, we describe a system- atic, comparative study of multiple machine learning approaches for classifying static analysis results1. Our research makes several key contributions. First, we create a real-world, ground-truth program analysis benchmark for the experiments, consist- ing of 14 Java programs covering a wide range of application domains and 400 vulnerability reports from a widely used static analysis tool for Java (Section 4.2). Second, we introduce key data preparation routines for applying neural networks to this problem (Section 4.1). Third, we compare the effectiveness of four families of machine learning approaches: hand-engineered features, bag of words, recurrent neural networks, and graph neural networks, with different combinations of data preparation routines (Section 4.3). Our experimental results provide significant insights into the performance and applicability of the classification and data preparation techniques in two different real-world application scenarios we studied. First, we observe that the recurrent neural networks perform better compared to the other approaches. Second, with more precise data preparation, we achieved significant performance improvements 1The experimental infrastructure of this study (including all raw data) is available at: https: //bitbucket.org/ugur_koc/mangrove 41 over the state-of-the-art [71]. Furthermore, with the two application scenarios we studied, we demonstrated that the data preparation for neural networks has a signif- icant impact on the performance and generalizability of the approaches, and differ- ent data preparation techniques should be applied in different application scenarios (Section 4.4). 4.1 Adapting Machine Learning Techniques to Classify False Posi- tives In this section, we discuss our adaptations of the four machine learning ap- proaches (discussed in Chapter 2) for false positive classification: hand-engineered features (HEF), bag of words (BoW), recurrent neural networks (RNN), and graph neural networks (GNN). 4.1.1 Hand-engineered Features In this work, we adapted the original feature set from Tripp et al. [10]. These features are: (1) source identifier (e.g., document.location), (2) sink identifier (e.g., window.open), (3) source line number, (4) sink line number, (5) source URL, (6) sink URL, (7) external objects (e.g., flash), (8) total results, (9) number of steps (flow milestones comprising the witness path), (10) analysis time, (11) number of path conditions, (12) number of functions, (13) rule name, (14) severity. From this list, we dropped source URL, sink URL, and external objects as they do not ap- pear in Java applications (our datasets consist of Java programs, Section 4.2). We 42 also extended it with two features extracted from static analysis reports that might improve the detection of false positives: confidence of the analyzer, and number of classes referred in the error trace (only relevant in object-oriented programming). We conjecture that longer error traces with references to many classes might indi- cate imprecision in the analysis, thus suggesting a higher chance of false positives. The confidence is nominal numeric value determined by the static analysis tool rep- resenting the internal robustness of the analysis conducted to generate the error report. Once the feature representations are defined, a wealth of classifiers f? and training algorithms can be used to learn how to make predictions. Since feature vectors ~x encode rich knowledge about the task, classifiers f? that compute sim- ple combinations of these features can be sufficient to train good models quickly. However, defining diverse features that capture all variations that might occur in different datasets is challenging and requires human expertise. Next, we explore how to represent program source code for more sophisti- cated machine learning approaches that can implicitly learn feature representations, starting with explaining how we summarized programs as a preprocessing step for them. 4.1.2 Program Slicing for Summarization BoW, LSTM, and GNN approaches work directly on the content of programs being analyzed, i.e. source code. Real-world programs are large (see Table 4.1), 43 and learning directly from such data is challenging since many sections of the code are unlikely to pertain to false positives and are likely to introduce noise for certain classification techniques. To address this difficulty, we computed the backward slice in the case study as shown in Chapter 3 [71]. The rationale for this choice is that the backward slice is a good summary of programs for the static analysis report classification task as it includes all the statements that may affect the behavior at the reported line. We use backward slice as well, as a pre-summarization step for BoW, RNN, and GNN approaches. In contrast to the case study in Chapter 3, in this work, we used Joana [72], a program analysis framework for Java, for computing backward slices. The first step is determining the entry points from which the program starts to run. For our problem, we first find the call hierarchy of the method containing the error line, then in this hierarchy, identify the methods that can be invoked by the user to set as the entry points. Such methods can be APIs if the program is a library, or the main method (which is the default entry point), or test cases. Next, we compute the program dependency graph (PDG), which consists of PDG nodes denoting the program locations that are reachable from the entry points. Then, we identify the PDG node(s) that appear in the reported source line. Finally, we compute the backward slice as a subgraph of the PDG by finding all the paths from the PDG nodes at the reported line to the entry point(s). Figure 4.1 shows an example PDG node. Line 1 shows the kind and ID of the node, which are EXPR and 164, respectively. At line 2, we see that the operation is a 44 1 EXPR 164 { 2 O reference; 3 V "v3 = com.mangrove.utils.DBHelper.conn"; 4 T "Ljava/sql/Connection"; 5 S "com/mangrove/utils/DBHelper.java":15,0; 6 DD 166; 7 CF 166; 8 ...} Figure 4.1: Sample PDG Node created with Joana program analysis framework (simplified for presentation) reference. At line 3, V denotes the value of the bytecode statement in WALA IR2. At line 4, T is the type of the statement (here, the Connection class in java.sql). Lastly, there is a list of outgoing dependency edges. DD and CF at lines 7 and 8 denote that this node has a data dependency edge and a control-flow edge, respectively, to the node with ID 166. In the following discussion, we will refer to these fields of PDG node. 4.1.3 Bag of Words In our experiments, we used two variations of BoW. The first variation checks the occurrence of words, which leads to a binary feature vector representation, where the features are the words. ?1? means that the corresponding word is in a program, and ?0? means it is not. The second variation counts the frequency of words, which leads to an integer feature vector, where each integer indicates how many times the corresponding word occurs. In our setting, ?words? correspond to tokens extracted from program slices using data preparation routines introduced in the next section 2Joana uses the intermediate representation from the T.J. Watson Libraries for Analysis (WALA) [64]. 45 (Section 4.1.4). Similar to the HEF approach, once the feature vector representations are cre- ated, any classification algorithm can be used for training. For a fixed classifier, training with BoW often takes longer than learning with HEF because the feature space (i.e., the dictionary) is usually significantly larger. 4.1.4 Recurrent Neural Networks In this study, we use the same LSTMs architecture that we used in the case study (see Figure 3.2) [71]. However, this time, we perform more precise data preparation with four sets of transformations. We denote each transformation as Tx for some x so we can refer to it later. We list the transformations in order of complexity, and a given transformation is applied only after applying all of the other, less sophisticated transformations. 4.1.4.1 Data Cleansing and Tokenization (Tcln) This set of transformations remove specific PDG nodes and perform basic to- kenization. First, they remove nodes of certain kinds (i.e., formal in, formal out, actual in, actual out), or whose value fields contain any of the phrases: many2many, UNIQ, ?init?, immutable, fake, exception , or whose class loader is Primordial, which means this class is not part of program?s source code. These nodes are re- moved because they do not provide useful information for learning. Some of them do not even inhibit anything from the actual content of programs, but instead they 46 are in the PDG to satisfy the static single assignment form3. For instance, the nodes with type NORM and operation compound do not have bytecode instructions from the program in their value field. Instead, their values contain the phrase many2many (en- coding many-to-many connections in the PDG), which is not useful for our learning task. Second, Tcln transformations extract tokens from paths of classes and methods by splitting them by delimiters like ?.?, ?,?, ?:?, and ?/?. 4.1.4.2 Abstracting Numbers and String Literals (Tans) These transformations replace numbers and string literals that appear in a program slice with abstract values. We hypothesize that these transformations will make learning more efficient by reducing the vocabulary of a given dataset and will help us to train more generalizable models. First, two-digit numbers are replaced with N2, three-digit numbers are with N3, and numbers with four or more digit are with N4+. We apply similar transfor- mations for negative numbers and numbers in scientific notation. Next, we extract the list of string literals and replace each of them with the token STR followed by a unique number. For example, the first string literal in the list will be replaced with STR 1. 3A property of the representation which requires that each variable is assigned exactly once, and every variable is defined before it is used [73]. 47 4.1.4.3 Abstracting Program-specific Words (Taps) Many programmers use a common, small set of words as identifiers, e.g., i, j, and counter are often used as integer variable identifiers. We expect that such commonplace identifiers are also helpful for our learning task. On the other hand, programmers might use identifiers that are program or domain-specific, and hence do not commonly appear in other programs. Learning these identifiers may not be useful for classifying static analysis reports in other programs. Therefore, Taps abstracts away certain words from the dataset that occur less than a certain amount of time, or that only occur in a single program, by replacing them with phrase UNK. Similar to Tans, these transformations are expected to improve the effectiveness and generalizability by reducing the vocabulary size via abstractions. 4.1.4.4 Extracting English Words From Identifiers (Text) Many identifiers are composed of multiple English words. For example, the getFilePath method from the Java standard library consists of three English words: get, File, and Path. To make our models more generalizable and to reduce the vocabulary size, we split any camelCase or snake case identifiers into their con- stituent words. Doing so, we increase the shared components of different programs as many of the extracted words are very commonplace in programming, like ?get? in the example above. Reducing rather less commonplace identifiers to such basic and common tokens will also help us learn more generalizable models. 48 Although in our case study [71], we have used transformations similar to Tcln, Tans, and Taps, to the best of our knowledge, the effects of these transformations have not been thoroughly studied before. This study partially addresses this knowl- edge gap with the systematic experiments we conducted with varying levels of data preparation (Section 4.3). Moreover, in this study, we further improved and ex- tended the transformations for mapping string literals and numbers with generic placeholders (e.g., STR 1, N1), splitting paths of classes and methods, and removing certain PDG nodes to improve the effectiveness and generalizability. 4.1.5 Graph Neural Networks In our study, we focus on a variation of GNNs called Gated Graph Neural Networks (GGNN) [74]. GGNNs have gated recurrent units and enable the initial- ization of the node representation. GGNNs have been used to learn properties about programs [74, 75], but have not been applied to classify static analysis reports or to learn from program slices. GGNNs require a fixed length vector to represent the graph nodes, i.e., the PDG node in our problem. PDG nodes, however, have arbitrary length informa- tion, mostly due to their ?Value? field, which holds the actual bytecode instruction. Therefore, creating a fixed length vector that captures all information available in the PDG node is a challenge. To adapt GGNNs to our problem, we introduce three approaches for initializing the input node representations ni. 49 4.1.5.1 Using Kind, Operation, and Type Fields (KOT) As the first representation, we only use the Kind, Operation, and Type fields of the PDG nodes. For the example node in Figure 4.1, the KOT node representation is Vrep =[EXPR, reference, Ljava/sql/Connection]. 4.1.5.2 Extracting a Single Item in Addition to KOT (KOTI) In the second representation, in addition to KOT, we include one more item that usually comes from the Value field of the PDG node depending on the Operation field. For example, if the operation is call, or entry, or exit, we extract the iden- tifier of the method that appears in the statement. The KOTI representation for the PDG node in Figure 4.1 is Vrep =[EXPR, reference, Ljava/sql/Connection, object] (object is the extracted item, meaning that the reference is for an object). 4.1.5.3 Node Encoding Using Embeddings (Enc) In the third representation, we use word embeddings to compute a vector representation that accounts for the entire bytecode in the Value field (which has an arbitrary number of words). To achieve this, we first perform pre-training using a bigger, unlabeled dataset to learn embedding vectors that capture more generic aspects of the words in the dictionary using the word2vec model [76, 77]. Then we take the average of the embedding vectors of the words that appear in the Value field as its representation, EV . Finally, we create a node vector by concatenating 50 EV with the embedding vectors of Kind, Operation, and type, i.e., Vrep = EK + EO + ET + EV where + is the concatenation operation. 4.2 Tool and Benchmarks Here, we explain the static analysis tool and corresponding ground-truth bench- marks we collected to empirically compare the effectiveness of the machine learning approaches. The static analysis tool we study is FindSecBugs [12] (version 1.4.6), a popular security checker for Java web applications. We configured FindSecBugs to detect cross-site scripting (XSS), path traversal (XPATH), and SQL, command, CRLF, and LDAP injections. We selected this subset of vulnerabilities because they share similarities in performing suspicious operations on safety-critical resources. Such operations are detected by the taint analysis implemented in FindSecBugs. We used two benchmarks in our evaluation. The first is the OWASP Bench- mark [13], which has been used to evaluate various static analysis tools in the liter- ature [78, 79]. In particular, we used the same programs that we used in our initial case study so that we could compare results; i.e., 2371 SQL injection vulnerability reports, 1193 of which are false positives; the remaining reports are true positives. We constructed the second benchmark, consisting of 14 real-world programs. We ran FindSecBugs on these programs, and then manually labeled the resulting 51 vulnerability reports as true or false positives. We chose these programs using the following criteria: ? We selected programs for which FindSecBugs generates vulnerability reports. To have the kinds of vulnerabilities we study, we observe that programs should perform database and LDAP queries, use network connections, read/write files, or execute commands. ? We chose programs that are open source because we need access to source code to apply our classification algorithms. ? We chose programs that are under active development and are highly used. ? Finally, we chose programs that are small to medium size, ranging from 5K to 1M lines of code (LoC). Restricting code size was necessary to successfully create the PDG, which is used for program slicing [72]. Table 4.1 shows the details of the programs we collected. Several programs have been used in past research: H2-DB and Hsqldb are from the Dacapo Bench- mark [94] (the other Dacapo programs did not satisfy our selection criteria), and FreeCS and UPM were used by Johnson et al. [95]. These 14 programs range from 6K to 916K LoC and cover a wide range of functionality (see the description column). In total, the 12 programs on GitHub have a large user base with 5363 watchers, 24 723 stars, and 10 561 forks on GitHub. The other two, Freecs and HSQDB, have 41K+ and 1M+ downloads, respectively, on sourceforge.net as of October 2018. 52 # of reports Program Description LoC TP FP Apollo-0.9.1 distributed config. management system [80] 915 602 4 6 BioJava-4.2.8 computational genomics framework[81] 184 040 26 32 FreeCS-1.2 chat server [82] 27 252 10 0 Giraph-1.1.0 graph processing system [83] 120 017 1 8 H2-DB-1.4.196 database engine [84] 235 522 17 30 HSQLDB-2.4.0 database engine [85] 366 902 43 15 Jackrabbit-2.15.7 content repository [86] 416 961 1 6 Jetty-9.4.8 web server with servlets [87] 650 663 12 4 Joda-Time-2.9.9 date and time framework [88] 277 230 2 3 JPF-8.0 symbolic execution tool [89] 119 186 15 27 MyBatis-3.4.5 persistence framework [90] 133 600 3 15 OkHttp-3.10.0 Android HTTP client [91] 60 774 10 2 UPM-1.14 password management [92] 6358 2 13 Susi.AI-07260c1 artificial intellegence API [93] 65 388 47 46 Total - 194 206 Table 4.1: Programs in the real-world benchmark. Running FindSecBugs on these programs resulted in more than 400 vulnerabil- ity reports. We then labeled the reports by manually reviewing the code4, resulting in 194 true and 206 false positives as ground-truth. To label a static analysis report, we first compute the backward call tree from the method that has the reported error line. Then we inspect the code in all callers until either we find a data-flow from an untrusted source (e.g., user input, HTTP request) without any sanity check? indicating a true positive?or we exhaust the call tree without identifying any tainted or unchecked data-flow?indicating a false positive. Through this review process, we observed that the false positives we found in the real-world benchmark were significantly different from those in the OWASP benchmark programs. The false positives of FindSecBugs usually happen due to 4Ugur Koc performed all of the labeling work, while other collaborators verified a random selection of labels [96]. 53 one of three scenarios: 1. the tool over-approximates and incorrectly finds an unrealizable flow; 2. the tool fails to recognize that a tainted value becomes untainted along a path, e.g., due to a sanitization routine; or 3. the source that the tool regards as tainted is actually not tainted. In the OWASP benchmark, false positives mostly stem from the first scenario. In our real-world benchmark, we mostly see only the second and third scenarios. This demonstrates that the OWASP programs do not represent the diversity in the real-world programs well enough. Therefore, creating a real-world benchmark for our study is essential to demonstrate the generalizability of our approaches. 4.3 Experimental Setup Applied preparations Approach name Occurrence feature vec. BoW-Occ Frequency feature vec. BoW-Freq Tcln LSTM-Raw Tcln + Tans LSTM-ANS Tcln + Tans + Taps LSTM-APS Tcln + Tans + Taps + Text LSTM-Ext Kind, operation, and type node vec. GGNN-KOT KOT + an extracted item GGNN-KOTI Node Encoding GGNN-Enc Table 4.2: BoW, LSTM, and GGNN approaches In this section, we discuss our experimental setup, including the variations of machine learning algorithms we compared, and how we divide datasets into training and test sets to mimic two different usage scenarios. 54 Variations of Machine Learning Algorithms. We compared the four families of machine learning approaches described in Section 4.1. For learning with HEF, we experimented with nine classification algorithms: Naive Bayes, Bayesian- Net, DecisionTree (J48), Random Forest, MultiLayerPerceptron (MLP), K*, OneR, ZeroR, and support vector machines, with the 15 features described in Section 4.1.1. We used the WEKA [97] implementations of these algorithms. For the other three families of approaches, we experimented with the variations described in Sections 4.1.3, 4.1.4, and 4.1.5. Table 4.2 lists these variations with their names and data preparation applied for them. For example, the approach LSTM- Raw uses Tcln transformations alone, while LSTM-Ext uses all four transformations. For BoW, we only used DecisionTree (J48) based on its good performance on HEF approaches. We adapted the LSTM implementation designed by Carrier et al. [65] and extended a GGNN implementation from Microsoft Research [98]. Application Scenarios. In practice, we envision two scenarios for using machine learning to classify false positives. First, developers might continuously run static analysis tools on the same set of programs as those programs evolve over time. For example, a group of developers might use static analysis as they develop their code. In this scenario, the models might learn signals that specifically appear in those programs, certain identifiers, API usage, etc. To mimic this scenario, we divide the OWASP and real-world benchmark randomly into training and test sets. In doing so, both training and test sets will have samples from each program in the dataset. We refer to the real-world random split dataset as RW-Rand for short. Second, developers might want to deploy the static analysis tool on a new 55 subject program. In this scenario, the training would be performed on one set of programs, and the learned model would be applied to another. To mimic this scenario, we divide the programs randomly so that a collection of programs forms the training set and the remaining ones form the test set. To our knowledge, this scenario has not been studied in the literature for the static analysis report classifi- cation problem. Note that the OWASP benchmark is not appropriate for the second scenario as all the programs in the benchmark were developed by same people and hence share many common properties like variable names, length, API usage, etc. We refer to the real-world program-wise split dataset as RW-PW for short. Training Configuration. Evaluating classification techniques requires sep- arating data points into a training set, that is used to estimate model parameters, and a test set, that is used to evaluate classifier performance. For both scenarios, we performed 5-fold cross-validation, i.e., five random splits for the first scenario and 5 program-wise splits for the second scenario, by dividing the dataset into five subsets and using four subsets for training and one subset for testing, for each 4-way com- binations. Furthermore, we repeat each execution five times with different random seeds. The purpose of these many repetitions (5-fold cross-validation ? 5 random seeds = 25 runs) is to evaluate whether the results are consistent (see RQ3). LSTM and GGNN are trained using an iterative algorithm that requires users to provide a stopping criterion. We set a timeout of five hours, and we ended training if there was no accuracy improvement for 20 and 100 epochs, respectively. We made this choice because an LSTM epoch takes about five times longer to run than a GGNN epoch, making this threshold approximately the same in terms of 56 clock time. For the LSTM, we conducted small-scale experiments of 15 epochs with the RW-Rand dataset and LSTM-Ext to determine the word embedding dimension for tokens and batch size. We tested 4, 8, 12, 16, 20, and 50 for the word embeddings and 1, 2, 4, 8, 16, and 32 for the batch size. We observed that word embedding dimension 8 and batch size 8 led to the highest test accuracy on average, and thus we use these values in the remaining experiments. We also used this embedding dimension for the pre-training of GGNN-Enc. Metrics. To evaluate the efficiency of the machine learning algorithms in terms of time, we use the training time and the number of epochs. After loading a learned model to the memory, the time to test a data point is negligible (around a second) for all algorithms. To evaluate effectiveness, we use recall, precision, and accuracy as follows: # of correctly classified true positives Precision(P ) = # of samples classified as true positive # of correctly classified true positives Recall(R) = all true positives # of correctly classified samples Accuracy(A) = # of all samples, i.e., size of test set Accuracy is a good indicator of effectiveness for our study because there is no trivial way to achieve high accuracy having an even distribution of samples for each class. Recall can be more useful when missing a true positive report is unacceptable (e.g., when analyzing safety-critical systems). Precision can be more useful when the cost of reviewing false positive report is unacceptable. All three metrics are 57 computed using the test portion of the datasets. Research questions. With the above experimental setup, we conducted our study to answer the following research questions (RQ). ? RQ1 (overall performance comparison): Which family of approaches perform better overall? ? RQ2 (effect of data preparation): What is the effect of data preparation on performance? ? RQ3 (variability analysis): What is the variability in the results? ? RQ4 (further interpreting the results): How do the approaches differ in what they learn? All experiments were carried on a 64-bit Linux (version 3.10.0-693.17.1.el7) VM running on 12-core Intel Xeon E312xx 2.4GHz (Sandy Bridge) processor and 262GB RAM. 4.4 Analysis of Results In total, we trained 1350 static analysis report classification models: 3 datasets ? 5 splits ? 5 random seeds ? (9 algorithms for HEF + 2 BoW variations + 4 data preparation routines for LSTM + 3 node representations for GGNN). The summary of the results can be found in Tables 4.3 and 4.4, as the median and semi-interquartile range (SIQR) of 25 runs. We report the median and SIQR because we do not have any hypothesis about the underlying distribution of the data. Note that, for HEF, 58 we list the four algorithms that had the best accuracy: K*, J48, RandomForest, and MLP. We now answer each RQ. 4.4.1 RQ1: Overall Performance Comparison In this section, we analyze the overall performance of four main learning ap- proaches using the accuracy metric and we observe that the trends we mention here also hold for the recall and precision metrics. In Table 4.3, we separated high performing approaches from others with a dashed-line at points where there is a large gap in accuracy. Overall, LSTM-based approaches outperform other learning approaches in accuracy. The deep learning approaches (LSTM and GGNN) classify false positives more accurately than HEF and BoW, at the cost of longer training times. The gap between LSTM and GGNN and other approaches is larger in the second application scenario suggesting that the hidden representations learned gen- eralize across programs better than HEF and BoW features. Next, we analyze the results for each dataset. For the OWASP dataset, all LSTM approaches achieve above 98% for recall, precision, and accuracy metrics. BoW approaches are close, achieving about 97% accuracy. The HEF approaches, however, are all below the dashed-line with below 80% accuracy. We conjecture that the features used by HEF do not adequately capture the symptoms of false (or true) positive reports (see Section 4.4.4). The GGNN variations have a big difference in accuracy. The GGNN-Enc achieves 94%, while the other two variations achieve around 80% accuracy. This suggests that for 59 Dataset Approach Recall Precision Accuracy LSTM-Raw 100.00 0 100.00 0 100.00 0 LSTM-ANS 99.15 0.74 98.74 0.42 99.37 0.42 LSTM-Ext 98.94 1.90 99.57 0.44 99.16 1.16 LSTM-APS 98.30 0.42 99.14 0.21 98.53 0.27 BoW-Occ 97.90 0.45 97.90 1.25 97.47 0.74 BoW-Freq 97.90 0.45 97.00 0.25 97.26 0.31 OWASP GGNN-Enc 92.00 5.00 94.00 5.25 94.00 1.60 HEF-J48 88.50 1.65 75.10 0.50 79.96 0.21 GGNN-KOTI 78.50 6.25 81.00 2.50 79.00 1.95 HEF-RandomForest 85.50 1.65 74.10 0.65 78.32 0.50 GGNN-KOT 80.00 3.25 77.50 2.00 78.00 0.95 HEF-K* 84.70 2.05 73.60 0.90 77.68 1.37 HEF-MLP 79.10 7.00 70.90 2.10 73.00 1.27 LSTM-Raw 90.62 2.09 86.49 3.52 89.33 2.19 LSTM-Ext 90.62 4.41 85.29 3.20 89.04 1.90 LSTM-APS 91.43 4.02 86.11 3.99 87.67 2.85 LSTM-ANS 89.29 2.86 84.21 3.97 87.67 1.59 BoW-Freq 86.10 2.30 87.90 1.85 87.14 1.85 BoW-Occ 84.40 4.45 87.50 3.85 85.53 2.45 RW-Rand GGNN-KOTI 83.00 4.50 84.00 3.50 84.21 1.55 HEF-K* 80.00 3.95 85.70 2.30 84.00 0.89 HEF-RandomForest 75.00 1.40 84.40 3.20 84.00 0.93 GGNN-KOT 89.00 7.00 80.00 7.00 83.56 3.48 GGNN-Enc 80.00 6.00 78.00 4.50 82.19 3.63 HEF-J48 78.10 2.15 82.40 0.90 81.33 0.92 HEF-MLP 71.40 2.80 86.20 6.10 81.33 1.97 LSTM-Ext 78.57 12.02 76.19 5.20 80.00 4.00 LSTM-APS 70.27 14.59 76.47 6.70 78.48 3.33 LSTM-ANS 62.16 25.58 75.76 7.02 74.68 3.85 LSTM-Raw 67.57 31.91 79.66 8.40 74.67 4.08 GGNN-Enc 77.00 36.00 75.00 19.50 74.67 5.89 GGNN-KOT 77.00 29.50 72.00 16.25 74.00 5.84 RW-PW HEF-MLP 58.10 14.65 70.40 9.40 73.08 7.76 GGNN-KOTI 65.00 33.50 75.00 11.00 72.02 5.12 HEF-K* 66.10 24.50 60.60 14.90 68.00 9.75 HEF-J48 60.70 11.65 72.70 12.80 65.33 8.04 HEF-RandomForest 62.50 24.30 60.30 5.55 63.44 2.67 BoW-Occ 50.00 12.90 65.00 22.30 51.32 4.61 BoW-Freq 47.80 16.50 65.70 14.70 51.25 8.55 Table 4.3: Recall, precision and accuracy results for the approaches in Table 4.2 and four most accurate algorithms for HEF, sorted by accuracy. The numbers in normal font are median of 25 runs, and numbers in smaller font semi-interquartile range (SIQR). The dashed-lines separate the approaches that have high accuracy from others at a point where there is a relatively large gap. 60 # of epochs Training time(min) LSTM-Raw 170 48 23 11 LSTM-ANS 221 47 32 4 LSTM-APS 237 35 31 4 OWASP LSTM-Ext 197 79 37 20 GGNN-KOT 303 113 28 10 GGNN-KOTI 218 62 20 6 GGNN-Enc 587 182 54 17 LSTM-Raw 62 1 303 1 LSTM-ANS 64 1 303 1 LSTM-APS 63 1 303 1 RW-Rand LSTM-Ext 50 0 304 2 GGNN-KOT 325 6 301 0 GGNN-KOTI 325 6 300 0 GGNN-Enc 326 4 300 0 LSTM-Raw 63 2 301 2 LSTM-ANS 65 2 301 6 LSTM-APS 65 2 302 2 RW-PW LSTM-Ext 52 2 303 2 GGNN-KOT 284 54 250 47 GGNN-KOTI 215 21 194 17 GGNN-Enc 245 50 211 58 Table 4.4: Number of epochs and training times for the LSTM and GGNN ap- proaches. Median and SQIR values as in Table 4.3 the OWASP dataset, the value of the PDG nodes, i.e., the textual content of the programs, carry useful signals to be learned during training. This also explains the outstanding performance of the BoW and LSTM approaches, as they mainly use this textual content in training. For the RW-Rand dataset, two LSTM approaches achieve close to 90% accu- racy, followed by BoW approaches at around 86%. GGNN and HEF approaches achieve around 80% accuracy. This result suggests that the RW-Rand dataset con- tains more relevant features the HEF approaches can take advantage of, and we conjecture that the overall accuracy of the other three algorithms dropped because 61 of the larger programs and vocabulary in this dataset. Table 4.5 shows the number of the words and length of samples for the LSTM approaches (the normal font is the maximum while the smaller font is the mean). As expected, the dictionary gets smaller while the samples get larger as we apply more data preparation. For GGNN, the number of nodes is 24 on average and 82 at most, the number of edges is 47 on average and 174 at most in the OWASP dataset. The real-world dataset has 1880 average to 16 479 maximum nodes, and 6411 average to 146 444 maximum edges. The real-world programs are significantly larger than the OWASP programs, both in dictionary sizes and sample lengths. For the RW-PW dataset, all the accuracy results except LSTM-Ext are below 80%. Recall that this split was created for the second application scenario where the training is performed using one set of programs, and testing is done using others. We observe the neural networks (i.e., LSTM and GGNN) still produce reasonable results, while the results of HEF and BoW dropped significantly. This suggests that neither the hand-engineered features nor the textual content of the programs are adequate for the second application scenario, without learning any structural information from the programs. Next, Both HEF and BoW approaches are very efficient. All their variations completed training in less than a minute for all datasets, while the LSTM and GGNN approaches run for hours for the RW-Rand and RW-PW datasets (Table 4.4). This is mainly due to the large number of parameters being optimized in the LSTM and GGNN. Lastly, note that the results on the OWASP dataset (Table 4.3) are directly 62 comparable with the results we achieved in the case study presented in the preivous chapter [71], which report 85% and 90% accuracy for program slice and control-flow graph representations, respectively. In this paper, we only experimented with pro- gram slices as they are a more precise summarization of the programs. With the same dataset, our LSTM-Ext approach, which does not learn from any program- specific tokens, achieves 99.57% accuracy. Therefore, we conjecture these improve- ments are due to the better and more precise data preparation routines we perform. 4.4.2 RQ2: Effect of Data Preparation Dictionary size Sample length Approach OWASP real-world OWASP real-world LSTM-Raw 333 13 237 735 224 156 393 18524 LSTM-ANS 284 9724 706 212 149 886 18104 LSTM-APS 284 9666 706 212 150 755 18378 LSTM-Ext 251 4730 925 277 190 950 23031 Table 4.5: Dataset stats for the LSTM approaches. For the sample length, numbers in the normal font are the maximum and in the smaller font are the mean. We now analyze the effect of different data preparation techniques for the machine learning approaches. Recall the goal of data preparation is to provide the most effective use of information that is available in the program context. We found that LSTM-Ext produced the overall best accuracy results across the three datasets. The different node representations of GGNN present tradeoffs, while the BoW variations produced similar results. Four code transformation routines were introduced for LSTM. LSTM-Raw achieves 100% accuracy on the OWASP dataset. This is because LSTM-Raw per- 63 forms only basic data cleansing and tokenization, with no abstraction for the vari- able, method, and class identifiers. Many programs in the OWASP benchmark have variables named ?safe,? ?unsafe,? ?tainted,? etc., giving away the answer to the classification task and causing memorizing the program-specific and concrete information from this dataset. Therefore, LSTM-Raw can be suitable for the first application scenario in which learning program-specific and concrete things can help learning. On the other hand, the RW-PW dataset benefits from more transformation routines that perform abstraction and word extraction. LSTM-Ext outperformed LSTM-Raw by 5.33% in accuracy for the RW-PW dataset. We presented three node representation techniques for GGNN. For the OWASP dataset, we observe a significant improvement in accuracy from 78% with GGNN- KOT to 94% with GGNN-Enc. This suggests that very basic structural informa- tion from the OWASP programs (i.e., the kind, operation, and type information included in GGNN-KOT ) carries limited signal about true and false positives, while the textual information included in GGNN-Enc carries more signal, leading to a large improvement. This trend, however, is not preserved on the real-world datasets. All GGNN variations, i.e., GGNN-KOT, GGNN-KOTI, and GGNN-Enc, performed similarly with 83.56%, 84.21%, and 82.19% accuracy, respectively, on the RW-Rand, and 74%, 72%, and 74.67% accuracy on the RW-PW datasets. Overall, we think the GGNN trends are not clear partly because of the nature of data such as sample lengths, dictionary, and dataset sizes (Tables 4.1 and 4.5). Moreover, the infor- mation encoded in the GGNN-KOT and GGNN-KOTI approaches is very limited, whereas the information encoded in GGNN-Enc might be too much (taking the av- 64 erage over the embeddings of all tokens that appear in the statement), making the signals harder to learn. BoW-Occ and BoW-Freq had similar accuracy in general. The largest differ- ence is 85.53% and 87.14% accuracy for BoW-Occ and BoW-Freq, respectively, on the RW-Rand dataset. This result suggests that checking the presence of a word is almost as useful as counting its occurrences. 4.4.3 RQ3: Variability Analysis In this section, we analyze the variance in the recall, precision, and accuracy results using the semi-interquartile range (SIQR) value given in the smaller font in Table 4.3. Note that, unlike other algorithms, J48 and K* deterministically produce the same models when trained on the same training set. The variance observed for J48 and K* is only due to the different splits of the same dataset. On the OWASP dataset, all approaches have little variance, except for a 7% SIQR for the recall value of HEF-MLP. On the RW-Rand dataset, SIQR values are relatively higher for all approaches but still under 4% for many of the high performing approaches. The BoW-Freq approach has the minimum variance for recall, precision, and accuracy. The LSTM- ANS and LSTM-Ext follow this minimum variance result. Last, the HEF-based approaches lead to the highest variance overall. On the RW-PW dataset, the variance is even bigger. For recall, in particular, 65 we observe SIQR values around 30% with some of the HEF, LSTM, and GGNN approaches. The best performing two LSTM approaches, LSTM-Ext and LSTM- APS, have less than 4% difference between quartiles in accuracy. We conjecture this is because the accuracy value directly relates to the loss function being optimized (minimized), while recall and precision are indirectly related. Lastly, applying more data preparation for LSTM leads to a smaller variance for all the three metrics for the PW-RW dataset. 4.4.4 RQ4: Further Interpreting the Results To draw more insights on the above results, we further analyze four represen- tative variations, one in each family of approaches. We chose HEF-J48, BoW-Freq, LSTM-Ext, and GGNN-KOT because these instances generally produce the best results in their family. Figure 4.2 shows Venn diagrams that illustrate the distri- bution of the correctly classified reports, for these approaches with their overlaps (intersections) and differences (as the mean for 5 models). For example, in Figure 4.2-A, the value 294 in the region covered by all four colors means these reports were correctly classified by all four approaches, while the value 1.8 in the blue only region mean these reports were correctly classified only by LSTM. The RW-Rand results in Figure 4.2-B show that 43 reports were correctly classified by all four approaches, meaning these reports have symptoms that are detectable by all approaches. On the other hand, 30.6 (41%) of the reports were misclassified by at least one approach. 66 A - The OWASP benchmark dataset: B - RW-Rand dataset: C - RW-PW dataset: Figure 4.2: Venn diagrams of the number of correctly classified examples for HEF- J48, BoW-Freq, LSTM-Ext, and GGNN-KOT approaches, average for 5 models trained. 67 The RW-PW results in Figure 4.2-C show that only 20 reports were correctly classified by all approaches. This is mostly due to the poor performance of the HEF-J48 and BoW-Freq. The LSTM-Ext and GGNN-KOT can correctly classify about ten more reports which were misclassified both by the HEF-J48 and BoW- Freq. This suggests that the LSTM-Ext and GGNN-KOT captured more generic signals that hold across programs. Last, the overall results in Figure 4.2 show that no single approach correctly classified a superset of any other approach, and therefore there is a potential for achieving better accuracy by combining multiple approaches. Figure 4.3-A shows a sample program from the OWASP dataset to demon- strate the potential advantage of the LSTM-Ext. At line 2, the param variable receives a value from request.getQueryString(). This value is tainted because it comes from the outside source HttpServletRequest. The switch block on lines 7 to 16 controls the value of the variable bar. Because switchTarget is as- signed B on line 4, bar always receives the value bob. On line 17, the variable sql is assigned to a string containing bar, and then used as a parameter in the statement.executeUpdate(sql) call on line 20. In this case, FindSecBugs overly approximates that the tainted value read into the param variable might reach the executeUpdate statement, which would be a potential SQL injection vulnerability, and thus generates a vulnerability warning. However, because bar always receives the safe value bob, this report is a false positive. Among the four approaches we discuss here, this report was correctly classified only by LSTM-Ext. To illustrate the reason, we show the different inputs of these 68 1 public void doPost(HttpServletRequest request...){ 2 String param = request.getQueryString(); 3 String sql, bar, guess = "ABC"; 4 char switchTarget = guess.charAt(1); // ?B? 5 // Assigns param to bar on conditions ?A? or ?C? 6 switch (switchTarget) { 7 case ?A?: 8 bar = param; break; 9 case ?B?: // always holds 10 bar = "bob"; break; 11 case ?C?: 12 bar = param; break; 13 default: 14 bar = "bob?s your uncle"; break; 15 } 16 sql = "UPDATE USERS SET PASSWORD=?" + bar + "? WHERE USERNAME=?foo?"; 17 try { 18 java.sql.Statement statement = DatabaseHelper.getSqlStatement(); 19 int count = statement.executeUpdate(sql); 20 } catch (java.sql.SQLException e) { 21 throw new ServletException(e); 22 } 23 } (A) 1 org owasp benchmark UNK UNK do Post ( Http Servlet Request Http Servlet Response ) : 2 String VAR 6 = p 1 request get Query String ( ) : 3 C VAR 10 = STR 1 char At ( 1 ) : 4 switch VAR 10 : String Builder VAR 14 = new String Builder : 5 String Builder VAR 18 = VAR 14 append ( STR 0 ) : 6 String Builder VAR 20 = VAR 18 append ( VAR 13 ) : 7 String Builder VAR 23 = VAR 20 append ( STR 3) : 8 String VAR 25 = VAR 23 to String ( ) : 9 java sql Statement VAR 27 = get Sql Statement ( ) : 10 I VAR 29 = VAR 27 execute Update ( VAR 25 ) : 11 PHI VAR 13 = VAR 6 STR 4 VAR 6 STR 2 (B) Figure 4.3: An example program (simplified) from the OWASP benchmark that was correctly classified only by LSTM-Ext (A) and the sequential representation used for LSTM-Ext (B) approaches. Figure 4.3-B shows the sequential representation used by LSTM-Ext. 69 HEF-J48 used the following feature vector: [rule name : SQL INJECTION, sink line : 19, sink identifier : Statement.executeUpdate, source line : 2, source identifier : request.getQueryString, functions : 4, witness length : 2, number bugs : 1, conditions : 1, severity : 5, confidence : High, time : 2, classes involved : 1] Notice that this feature vector does not include any information about the string variable guess, the switch block, or overall logic that exists in the program. Instead, it relies on correlations that might exist for the features above. For this example, such correlations weight more for the true positive decision, thus lead to misclassification. On the other hand, the LSTM-Ext representation includes the program infor- mation. For example, VAR 6 gets assigned to the return value of the request.get QueryString method and VAR 10 is defined as STR 1 . char At (1). Note that STR 1 refers to the first string that appears in this program, i.e., ??ABC??). We see the tokens switch VAR 10 at line 5 corresponding to the switch statement. Then, we see string and SQL operations through lines 7 to 12, followed by a PHI instruction at line 13. This sequential representation helps LSTM-Ext to correctly classify the example as a false positive. 70 Last, BoW-Freq misclassified this example using the tokens in Figure 4.3-B without their order. This suggests that the overall correlation of the tokens that appear in this slice does not favor the false positive class. We argue that the correct classification by LSTM-Ext was not due to the presence of certain tokens, but rather due to the sequential structure. 4.4.5 Threats To Validity There are several threats to the validity of our study. First, the benchmarks may not be representative. Like we mentioned earlier, the OWASP benchmark is synthetic. Therefore, we collected the first real-world benchmark for classifying static analysis results, consisting of 14 programs to increase the generalizability of our results. In addition, our real-world benchmark consists of 400 data points, which may not be large enough to train neural networks with high confidence. We repeated the experiments using different random seeds and data splittings to ana- lyze the variability that might be caused by having limited data. Second, we ran our experiments on a virtual machine, which may affect the training times. How- ever, since these models would be trained offline and very rarely, we are primarily interested in the effectiveness of the approaches in this study. 4.5 Attributions and Acknowledgments The study presented in this chapter is supported by the Department of Home- land Security under the grant #D15PC00169 and published in the proceedings 71 of 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST?19) [18]. All development and experimentation tasks are carried out by me, Ugur Koc, with the guidance and supervision of other collaborators. 72 Chapter 5: Auto-tuning Configurable Program Verification Tools Many static program verification tools can be customized by setting runtime configuration options. For example, CBMC [99], a popular verification tool for C/C++ programs, has 98 configuration options, such as --unwind which specifies the number of times to unwind loops when computing the model of the program. Most of these options present tradeoffs between performance, precision, and sound- ness. Although these options allow users to customize the tools for their own needs, choosing the best options for a specific task is challenging, requiring domain exper- tise and extensive experimentation [19]. In addition, the effectiveness of an setting an option (or a combination of options) can depend on the features present in the target program [24, 100], making it difficult to find a single configuration that pro- duces desirable results (e.g., fast and precise) for differing programs. In practice, users, especially a non-experts, often run a static bug detector on target programs with a provided ?default? configuration to see if it produces desirable outputs. If not, these users often do not know how to effectively modify the analysis options to produce better results. We believe this challenge has prevented many verification tools from being used to their full potential. Recent studies have shown that configuration options indeed present tradeoffs 73 [19], especially when different program features are present [20, 21, 22]. Various techniques have been proposed that selectively apply a configuration option to cer- tain programs or parts of a program (i.e., adaptive analysis), using heuristics defined manually or learned with machine learning techniques [20, 23, 21, 24, 22, 25]. Al- though a promising research direction, these techniques are currently focused on tuning limited kinds of analysis options (e.g., context-sensitivity). In addition, ma- chine learning techniques have recently been used to improve the usability of static analysis tools. The applications include classifying, ranking, or prioritizing analy- sis results (like the learning approach we introduced and evaluated in Chapters 3 and 4) [9, 10, 26, 27, 7, 28, 29], and ranking program verification tools based on their likelihood of completing a given task [30, 31]. However, the configurability of static analysis tools, verification tools specifically, has not been considered in these applications. We believe that automatic configuration selection will make pro- gram verification tools more usable and enable their wide adoption in the software engineering practice. In this chapter, we present auto-tune to automatically configure program verification tools for given target programs. We aim to develop a generalizable ap- proach that can be used for various tools that are implemented in and targeted at different programming languages. We also aim to develop an efficient approach that can effectively search for desirable configurations in large spaces of configura- tions. To achieve these goals, our approach leverages two main ideas. First, we use prediction models using supervised machine learning as both fitness functions and incorrect result filters. Our prediction models are trained on language-independent 74 features in target programs and configuration options of the subject verification tools. Second, we use a meta-heuristic search algorithm that probabilistically scans the configuration spaces of verification tools using the aforementioned prediction models. Overall, auto-tune works as follows: we first train two prediction models to be used in the meta-heuristic search algorithm using a ground-truth dataset that consists of correct, incorrect, and inconclusive1 program analysis runs. The first model, the fitness function, is trained on the entire dataset; the second model, the incorrect result filter (or, for short, filter), is trained only on the conclusive part of the dataset ?i.e., we exclude the inconclusive analysis runs. Our search algorithm starts with a default configuration of the tool if available; otherwise, it starts with a random configuration. The algorithm then systematically, but non-deterministically, alters this configuration to generate a new configuration. Throughout the search, the fitness function is used to decide whether a configuration is good enough to accept as the current candidate solution, and the filter is used to decide whether to run the tool with this configuration or not. The algorithm continues to scan the search space by generating new configurations until it locates one that meets the thresholds in the fitness and filter functions and that leads to a conclusive result when run. auto-tune is a meta-reasoning approach [32, 33] because it aims to reason about how verification tools should reason about a given verification task. In this 1An inconclusive analysis run means the tool fails to come to a judgement due to a timeout, crash, or a similar reason. 75 setting, the reasoning of a given verification tool is controlled by configuration op- tions that enable/disable certain simplifications or assumptions throughout the anal- ysis tasks. The ultimate goal of meta-reasoning is to identify a reasoning strategy, i.e., a configuration, that is likely to lead to the desired verification result. We applied auto-tune to four popular program verification tools. CBMC and Symbiotic [34, 35] verify C/C++ programs, while JBMC [36] and JayHorn [37] verify Java programs. We generated program analysis datasets with the ground truths from the SV-COMP2, an annual competition of software verification that includes a large set of both Java and C programs. We used our datasets, which contain between 55K and 300K data points, to train prediction models for each tool. To evaluate the effectiveness of auto-tune, we consider two use cases. First, to simulate the scenario in which a non-expert uses a tool without a reliable de- fault configuration, we start auto-tune with a random configuration. Our experi- ments suggest that auto-tune could produce results comparable to configurations manually and painstakingly selected by program analysis experts (referred to as comp-default3 in the remainder of this chapter). Second, to simulate the scenario in which a tool comes with a reliable default configuration, or is used by an expert, we start auto-tune with the comp-default configuration. Our results suggest that, with regard to the competition?s scoring system, auto-tune could improve the SV- COMP performance for three out of four program verification tools we studied: Symbiotic, JayHorn, and JBMC . For CBMC , auto-tune also increased the num- 2https://sv-comp.sosy-lab.org/2019 3These are the configurations the tools used when participating the competition. 76 ber of correct analysis runs significantly, but it did not improve the competition score due to the large penalty for incorrect results. We also studied the effects of different choices in our algorithm, including two supervised machine learning ap- proaches (i.e., classification and regression) and three strategies for creating a new configuration in the search. In summary, the contributions of our work are: ? A novel meta-heuristic search approach to automatically configure program verification tools, using machine learning models both as a fitness function and as a false result filter ? Successful applications of the approach to four state-of-the-art verification tools for C and Java (Section 5.4). ? The collection and analysis of ground-truth datasets, showing the distribution of results across different analysis configurations (Section 5.4). ? Empirical evaluations of different use-case scenarios that demonstrate the ef- fectiveness of auto-tune and an in-depth study of the choices in the algo- rithm (Section 5.5). We have made the implementation, datasets and evaluation results publicly available: https://bitbucket.org/ugur_koc/auto-tune 77 1 extern void __VERIFIER_error() __attribute__ ((__noreturn__)); 2 void __VERIFIER_assert(int cond) { 3 if(!(cond)) { ERROR: __VERIFIER_error(); } 4 } 5 6 #define N 100000 7 8 int main( ) { 9 int min = 0, i = 0, a[N]; 10 while ( i < N ) { 11 if ( a[i] < min ) 12 min = a[i]; 13 i++; 14 } 15 for (int x = 0 ; x < N ; x++ ) { 16 __VERIFIER_assert( a[x] >= min ); 17 } 18 return 0; 19 } A - Program P1 1 extern void __VERIFIER_error() __attribute__ ((__noreturn__)); 2 void __VERIFIER_assert(int cond) { 3 if(!(cond)) { ERROR: __VERIFIER_error(); } 4 } 5 6 #define N 100000 7 8 int main( ) { 9 int src[N], dst[N], i = 0; 10 while ( src[i] != 0 ){ 11 dst[i] = src[i++]; 12 } 13 for (int x = 0 ; x < i ; x++ ) { 14 __VERIFIER_assert( dst[x] == src[x] ); 15 } 16 return 0; 17 } B - Program P2 Figure 5.1: Code examples from the SV-COMP 2018. 78 5.1 Motivating Examples We now demonstrate the challenges involved in configuring verification tools using the two motivating examples in Figure 5.1. These two example programs are extracted from SV-COMP 2018 [101]. Both popular C verification tools we studied, CBMC and Symbiotic, produce inconclusive analysis results (i.e., timeout in 15 min- utes or crash with out of memory error) using their developer-tuned comp-default configurations on these programs. This leaves the tool users uncertain about whether they can successfully analyze these programs with other configurations. Figure 5.1-A presents a safe program, P1. Its main functionality is to find the minimum value in the integer array a (lines 10-14). Line 16 uses an assertion to check if all the elements are greater than or equal to the computed minimum value. If the assertion fails, it triggers the ERROR on line 3. While, in fact, the ERROR cannot be reached in any execution of this program, CBMC ?s comp-default led to inconclusive results. To understand the difficulty of successfully analyzing this program, we manually investigated 402 configurations of CBMC. Only 48 (12%) of them lead to the correct analysis result, while others were inconclusive. We identified that --depth=100, which limits the number of program steps, is the only option value that is common in successful configurations and different from its value (no depth limit) in comp-default. Using this option value is critical when analyzing P1 because it improves the scalability of CBMC so that it finishes within the time limit. We made similar observations on the configurations of Symbiotic. Investigating the configurations that led to the correct result, 81 out of 222, we found that it is critical 79 to turn the --explicit-symbolic4 option on for Symbiotic to scale on P1 rather than using comp-default which turns it off. This example demonstrates that some options may be important for improving the performance of certain verification tools on certain programs and that altering them from the default configurations may yield better analysis results. Figure 5.1-B shows an unsafe program, P2, with an array-out-of-bounds error. At lines 10-11, the program copies elements from src array to dest array in a loop until it reaches 0 in src. At lines 13-14, the assertion checks if src and dest have the same elements up to index i. The problem with this program is that if src does not contain 0, the array accesses at line 10 will exceed the bounds of the arrays. Re- call that Symbiotic?s comp-default also led to an inconclusive result on P2. From a manual investigation of P1, we know that turning on option --explicit-symbolic may be critical to the performance of the tool. However, doing this leads to incorrect results on P2 due to unsoundness, because the random values used for initialization may actually contain 0. Out of the 137 Symbiotic configurations we manually in- vestigated for P2, only 3 of them led to the correct analysis result, while 123 were inconclusive, and 12 were incorrect. In the 12 configurations leading to the incorrect result, --explicit-symbolic was always on. The above motivating examples show that there may not exist a single config- uration under which a tool performs well across all programs, because the options interact differently with programs depending on the programs? features. However, 4Setting --explicit-symbolic:on in Symbiotic results in initializing parts of memory with non-deterministic values. Otherwise, evaluation is done with symbolic values which leads tracking more executions paths (costly). 80 such manual investigation is costly and requires domain expertise, demonstrating the need for auto-tune, whose goal is to automatically locate tool configurations that are likely to produce desired analysis results for a given program. 5.2 Our Auto-tuning Approach Figure 5.2: Workflow of our auto-tuning approach. One way to perform auto-tuning is to train classifiers to predict the appropriate settings for all configuration options (a configuration consisting of many analysis options), given a target program to analyze. This can be achieved with multi-target classification [102] by treating each configuration option as a label. However, as our motivating examples show, only a few options may be strongly associated with certain program features. This means that in order to achieve high accuracy in a multi-target learning model, the ground-truth dataset should have many data points for these options while the replications of the other options represent noise. Without knowing in advance which analysis options are important, prohibitively 81 large datasets would be required in order to contain the necessary replications of each configuration option. Alternatively, the amount of data needed sharply decreases if one is able to frame the problem as a single target problem, instead of a multi-target problem. Our idea for auto-tune is to formulate the problem as a search problem that uses models trained from single target classification/regression. Figure 5.2 shows the workflow of auto-tune. The key component is a meta- heuristic configuration search that navigates through a tool?s configuration space and makes predictions about candidate configurations using machine learning mod- els trained offline. To use auto-tune, a user provides a target program and an optional initial configuration. auto-tune runs the tool with the initial configura- tion if provided. If the initial configuration leads to an inconclusive result or if the user does not provide an initial configuration (e.g., a good default configuration of a tool is not available), auto-tune explores the configuration space to locate a configuration likely to produce conclusive and correct analysis results for the tar- get program (using thresholds and prediction models). The located configuration is then used to run the verification tool. The search algorithm is iteratively ap- plied until a conclusive analysis result is produced, or the search terminates after being unable to find a conclusive result. This workflow applies to automatically configuring many configurable program verification tools. 82 Algorithm 2 Meta-heuristic configuration search algorithm 1: function ConfigSearch(CS= ?O,D?, P , M , F , ?) 2: T ? 1;Ts ? 10?5;R? 10?4 . hyper parameters 3: analysisResult? ?999 . analysis exit code 4: V ? getProgramRepresentation(P ) 5: c? ? c? getInitialConfig(CS) 6: Ec? ? Ec ? query(M, ?V + c?) . cost for ?V + c? 7: while Ts < T ??isComplete(analysisResult) do 8: repeat 9: c? ? generateNeighboringConfig(CS, c) 10: Ec? ? query(M, ?V + c??) 11: ?E ? Ec? ? Ec 12: T ? T ? (T ?R) 13: until ?E < 0 ? rand(0, 1) < e?k?E/T 14: c, Ec ? c?, Ec? 15: if Ec < Ec? then 16: c?, Ec? ? c, Ec . best config so far 17: Sc?query(F, ?V+ c?) . incorrectness score for ?V+ c? 18: if Sc?? then 19: analysisResult? runSA(P, c) 20: return ?analysisResult, c?? 5.2.1 Meta-heuristic Configuration Search We now present our meta-heuristic configuration search algorithm, which ef- fectively explores and predicts configurations from a subset of a tool?s configurations. Our algorithm is inspired by simulated annealing [103, 104, 105]. Simulated annealing is a meta-heuristic search algorithm that mimics the metallurgical process of annealing. It probabilistically scans the search space to approximate the globally optimal solution). Generally, simulated annealing starts from an initial state and iteratively searches for a good solution by applying alterations to the current state. This generates a new candidate state (i.e., neighboring state). If the generated 83 neighboring state is better than the current state, it is ?accepted? as the current state for the next search iteration. Otherwise, to avoid getting stuck in local optima, ?inferior? neighboring states are probabilistically accepted. There are three control parameters in simulated annealing: the initial temperature T0, the cooling rate R by which the temperature T is reduced every iteration, and the stopping temperature Ts. At any given iteration t, the temperature Tt determines the probability that an inferior neighbor state is accepted as the new current state. This has the effect of reducing the probability that inferior neighboring states are accepted as the search progresses; i.e., inferior neighboring states are more likely to be accepted at early iterations. The search ends either when a good solution is found, or when the temperature reaches Ts. Algorithm 2 shows our meta-heuristic configuration search. The inputs to this algorithm are: 1. the tool?s configuration space CS = ?O,D? where O is the set of configuration options and D is their domains, 2. the target program P , 3. the fitness function model M , 4. the filter model F , and 5. the threshold value ? for the incorrectness score computed by the filter. Lines 2-6 perform the initializations. First, we initialize the hyper-parameters as follows: T = 1, T = 10?5s , and R = 10 ?4 (empirically determined). The variable 84 analysisResult is set to ?999, referring to an inconclusive analysis result. Line 4 extracts the program features of P and assigns them to V . At line 5, the current configuration (c) and the best configuration (c?) are both initialized with either a default configuration provided by the user or a randomly generated configuration. At line 6, M is used to compute Ec, which is the predicted cost of this configuration running on a program with features V . Depending on whether M is configured as a classification or regression model (see Section 5.2.2), this cost can be a continuous score (regression) or the probability that c will produce either an inconclusive or incorrect result). Lines 7 to 19 are the search iterations. Line 7 has two guards: Ts < T , to check that the temperature has not decreased below the stopping temperature, and isComplete(analysisResult), which stops the search when a conclusive result has been produced (since our goal is only to find a ?good? rather than ?the optimal? solution). The inner loop at lines 8-13 generates the neighboring configuration c? (line 9), predicts its cost using M (line 10), and repeats until one of the acceptance conditions on line 13 is met. There are two reasons a c? can be accepted as the cur- rent configuration c: either it is better than c (?E < 0), or because the probabilistic condition indicating whether to accept an inferior state, rand(0, 1) < e?k?E/T , holds (line 13). The current solution c and current cost Ec are then updated, according to the generated configuration at line 14. If the fitness model predicts the current configuration is the best so far (E ?c < Ec?), C and Ec? are also updated accordingly (lines 15-16). When a new configuration is determined to be the best, we want to run the analysis tool and update analysisResult accordingly. However, running the 85 analysis tool is an expensive operation. We thus use a second prediction model F on line 17, the filter, to predict an ?incorrectness score;? in other words, the likeli- hood that running the tool with the configuration will return an incorrect analysis result. We predict this score Sc, and only run the analysis tool if Sc is less than the parameter ? (lines 18-19). If the analysis produces conclusive results, the search ends, and we return the analysis results (line 20). Otherwise, the search continues. To summarize, our search algorithm adapts simulated annealing with three key enhancements specific to our application: 1. The algorithm does not require finding the optimal solution but rather, a good solution, which is implemented through the additional stopping condition, isComplete(analysisResult). 2. The cost estimated by the fitness function model, Ec, is not just the cost of a configuration c, but the cost of the configuration and program pair ?V + c? 3. The filter model and the control mechanism that uses this model to decide whether to run the analysis tool or not (lines 17-19). 5.2.2 Learning the Fitness Function and Filter Data Structure. A data point in our dataset is precisely what is used to query the fitness function in the search process; i.e., V + c = X where V is the program feature vector, + is the concatenation operator, c is the configuration, and X is the analysis result (i.e., correct, incorrect, and inconclusive). 86 This way, when predicting an analysis result for a new program and configuration pair, the machine learning algorithms will not just use program features or config- uration options, but also their interactions. Feature Extraction. Recall that one of our goals is to be able to apply our ap- proach to tools targeted at analyzing programs in different languages. Therefore, we extract generic program features that are available in most (if not all) program- ming languages. We use a simple bag of words technique that has been used in the analysis report classification study (presented in Chapter 4) [18]. We only count the number of each kind of intermediate level instructions in three-address code and loops. In particular, we use 56 LLVM IR [106] instructions for C and 33 WALA IR [64] instructions for Java. Prediction Models. To navigate in the search space, we need a way to measure the cost/fitness of each candidate solution. We achieve this through a fitness function. The fitness function should output continuous numerical cost values for candidates. In this problem, we do not know what the exact costs of configurations are; therefore, we approximate this cost by using a prediction model as a fitness function. We achieve this in two alternative ways. One way is to train classifiers and use them to compute the probability that a data point (V + c) results in either an inconclusive or incorrect analysis result. This probability is used as the cost/fitness of the data point. Alternatively, we can map classes of analysis results to certain numeric values 87 and train regression models. Now, the regression models? output is used as the cost. In this case, since our search is designed as a minimization problem, i.e., aiming to reduce the cost, the desired analysis results should be mapped to values smaller than the values of undesired (i.e., incorrect and inconclusive) analysis results. One advantage of the regression approach is customizability; i.e., the distance between the correct and incorrect classes can be adjusted for different scenarios. In certain application domains, missing a buggy program (i.e., false negative, FN) is not ac- ceptable. To account for that, we can assign a very high-cost value to FN class while a relatively low-cost value to FP class. This way, our approach can tolerate FP results, but the goal is to avoid missing any buggy/defective in programs. In other application domains, FN results might be acceptable, while wasting develop- ers? time and other resources on FP results are not. Accordingly, the cost value for FP can be set high, and the cost value for FN relatively low. We explore both of these approaches, i.e., classification and regression, in our study. We provide and discuss the cost values we used in our evaluations with regression models in Section 5.5.1. Once we have candidate configurations, our next goal is to reduce the likelihood of selecting a final configuration that produces an incorrect result. We thus utilize a second prediction model?the incorrect result filter?that is used in the search to compute the likelihood of a data point (V + c) corresponding to an incorrect analysis result. We call the output of this filter an incorrectness score. We train the filter in the same way as the fitness function, using classification or regression. In fact, the only difference between the fitness function and the filter is the dataset they trained 88 on. The fitness function is trained on the entire dataset, so, ideally, it can learn the relations between the data and analysis results, including the inconclusive ones. The filter, however, is trained on the conclusive analysis runs only in order to make the filter more specialized for learning program features, configuration options, and their potential interactions that might lead to incorrect analysis results. To obtain the ground-truth datasets for a given verification tool, we system- atically select sample configurations from its configuration space and run it with each selected configuration on a set of benchmark programs with known correctness properties (i.e., unsafe or safe). After generating the labeled dataset, we train two models to be used in the configuration search, as described in Section 5.2.1. 5.2.3 Neighboring Configuration Generation The neighbor generation strategy of any search algorithm is an essential com- ponent that can significantly affect performance. A good neighbor generation strat- egy should select good candidates to direct the search for scanning the entire space in a cost-effective manner. One common way to generate neighboring states is to alter the current state randomly. We have three different neighbor generation strategies. First, as the baseline, we generate a neighboring configuration by altering the current configuration at a maximum of three random points with no restrictions. These alterations take place randomly within the domain of each option, which is defined manually. We call this strategy the base strategy. 89 In order to come up with more sophisticated strategies, we performed a main effect screening study [107] on the ground-truth datasets. We observed that there are configuration option values that have statistically significant effects on the num- ber of correct, incorrect, and/or inconclusive analysis results. We implement two alternative neighbor generation strategies that make use of these findings. The second strategy is the greedy strategy. For this strategy, when the current configuration is altered to produce the neighboring configuration, we forbid any values of an option oi,j (referring to the j th value of option i) that increases the number of incorrect results by significantly more than it increases the number of correct results. However, if an option value is contributing to both the correct and incorrect classes equally, we do not forbid it. We expect the fitness function to learn the interactions between such option settings and program features and favor the setting that is likely to lead to the correct analysis result for a given program. The third strategy is the conservative strategy. In this strategy, any oi,j that increases the number of incorrect results is forbidden, regardless of whether it also increased the number of correct results. To identify the forbidden option values for the greedy and conservative strate- gies, we performed main effect screening studies on the ground-truth datasets using JMP [108]. Any option value that increases incorrect results and has a statisti- cally significant effect p < 0.05 according to the screening study is forbidden during neighbor generation if the appropriate rule applies. 90 5.3 Implementation We implemented the proposed approach in 600 lines of Java code5. Specifically, we used the random forest algorithm6 for classification [110] and the REP (stands for Reduced-Error Pruning) tree algorithm7 for regression [111] (both with default parameters). In this implementation, supervised training of the predictive models and search are automated. It takes as input: 1. the training and testing dataset files; 2. an executable script to run the verification tool; 3. the choice of starting point (default or random config); 4. choice of neighbor generation strategy (base, greedy, or conservative); 5. choice of learning model (regression or classification); and 6. threshold value (there are other optional inputs we do not list here, see --help option of the tool). First note, if the starting point is chosen as the default config (the default config file must be present), the search will start only if the default config fails to return a conclusive analysis result. Second note, the threshold values are at different scale for classification and regression. For classification, it should be between 0-1, 5Weka Dev 3.9.3 [109] is the only framework that we have as a dependency for the machine learning algorithms. 6Implementation: weka.classifiers.trees.RandomForest 7Implementation: weka.classifiers.trees.REPTree 91 where 0 will lead to very conservative runs that will filter any configuration that is not guaranteed to complete with a correct result, while 1 means no threshold. For regression, this value will depend on the numeric value mappings the classes have. For example, if the correct classes are mapped to 0, and incorrect classes are mapped to 100, a threshold value of 10 will cause filtering any configurations that lead to a falseness score grader than 10. 5.4 Datasets In this section, we describe the subject analysis tools we used in our evaluation, benchmark programs, and the ground-truth dataset we created for each subject tool. 5.4.1 Subject Tools Target # of Config Sample Dataset Tool lang options space size size size CBMC 5.11 C 21 2.9 ? 109 295 295 000 Symbiotic 6.1.0 C 16 9.8 ? 105 82 54 940 JayHorn 0.6-a Java 12 7.5 ? 106 256 94 208 JBMC 5.10 Java 27 7.2 ? 1010 200 73 600 Table 5.1: Subject verification tools. Table 5.1 lists the subject verification tools that we used in our study. We chose these tools because they all participated in SV-COMP 2019. Therefore, we can collect the ground-truth performance using the scripts provided with the infras- tructure. To demonstrate the generality of our approach, we chose two tools that verify C programs: CBMC and Symbiotic), and two tools that verify Java programs: JBMC and JayHorn. All four tools have configuration options for customization. In 92 our evaluation, we focus on the options that affect the analysis performance, sound- ness, and/or precision, instead of the ones for formatting output and turning on/off specific checkers. The third and fourth columns in Table 5.1 show the number of options we use and the number of possible configurations than can be created with them, respectively. 5.4.2 Benchmark Programs All of our benchmark programs are from the SV-COMP. For Java tools, we used all 368 benchmark programs from SV-COMP 2019. All of the Java benchmarks are written with assertions, and the verification tools check if these assertions always hold. Among the 368 programs, 204 (55.4%) are unsafe. For C tools, we randomly selected 1000 out of 9523 programs from SV-COMP 2018.8 For C programs, there are five different properties to check: concurrency safety, memory safety, overflow, reachability, termination. In our sample set, there are 335 programs for concurrency safety9, 51 for memory safety, 65 for overflow, 485 for reachability, and 130 for termination. Among the 1000 programs, 517 (51.7%) are unsafe. 8SV-COMP 2019 data were not available when we started this research. The benchmark set for C is mostly the same between SV-COMP 2018 and 2019. 9Symbiotic is opted out the concurrency safety category in SV-COMP. Accordingly we did not run the tool for these tasks. 93 Dataset # of samples for each class Tool size UNK TN TP FN FP CBMC 295 000 171 913 25 556 56 310 21 882 19 339 Symbiotic 54 940 36 551 5099 11 300 1966 24 JayHorn 94 208 62 383 17 435 9916 166 4308 JBMC 73 600 45 866 12 117 5399 9138 1080 Table 5.2: Data distribution in the ground-truth datasets (aggregated) 5.4.3 Ground-truth Datasets Because it is infeasible to test the tools? performance using all configurations, we use sample sets of the configurations to collect the ground-truth datasets. Exist- ing research has shown that combinational testing using covering arrays results in representative sample sets with good coverage of the configuration space [112, 113]. We, therefore, create a 3-way covering array for each tool which is a list of configu- rations that include all 3-way combinations of configuration options [112]. The fifth column in Table 5.1 shows the number of configurations we used as a sample for generating ground truths of each tool. We run each configuration of a tool with a 60-second timeout on each benchmark program to create the ground-truth datasets. We used a shorter timeout than the competition due to the limited resources we have. Also, preliminary experiments with one tool, CBMC , showed that the distri- bution of correct, incorrect, and inconclusive analysis results did not change much with 300 seconds timeout and the tasks that complete usually get completed in the first minute. The sizes of the datasets range from 54 900 (Symbiotic) to 295,000 (CBMC ) (last column in Table 5.1). Figure 5.3 and Table 5.2 show the distribution of analysis results in each 94 Figure 5.3: Distribution of analysis results for each sample configuration 95 dataset. Each stacked bar in Figure 5.3 corresponds to the number of verification correct, incorrect, or incomplete analysis results with a configuration. More, the con- figurations are ordered by the number of verification tasks they completed with the correct result. Note that, there are many inconclusive verification runs (patterned grey bars) for all tools. Respectively, 58%, 67%, 66%, and 62% of the verification runs are inconclusive for CBMC , Symbiotic, JayHorn, and JBMC (third column of Table 5.2). Maybe surprisingly, incorrect results (orange bars) are common in these verification tools too. Respectively, 33%, 11%, 14%, and 37% of the verification runs are incorrect for CBMC , Symbiotic, JayHorn, and JBMC (sixth and seventh columns of Table 5.2). In Figure 5.3, we observe large variances in the behaviors of different config- urations. CBMC, JayHorn, and JBMC all have a number of configurations that do not produce any conclusive results, while there exist some configurations in each tool that can produce a few hundred correct results. The Symbiotic results have lower variance compared to the others, but its most effective configuration (right- most bar in the Symbiotic plot in Figure 5.3) still produces 136 more correct results than the least effective one. Considering precision, CBMC, JayHorn, and JBMC also have configurations that produce conclusive results for almost all the programs at the cost of many incorrect results, demonstrating significant soundness and per- formance tradeoffs that configuration options of these tools control. Figure 5.3 also shows the comp-default configuration results of each tool as the thicker bar with an arrow pointing to it. We observe that all the comp-default configurations lead to few incorrect results, while the comp-default of CBMC, Sym- 96 biotic, and JayHorn do not lead to the largest number of correct results. Specifically, only 1, 2, 2, and 0 incorrect results but 573, 446, 180, and 37 inconclusive results were produced by the comp-default configurations of CBMC, Symbiotic, JayHorn, and JBMC, respectively. This result shows that even though the comp-default configurations have been manually tuned to reduce the number of incorrect results, there still is room for big improvements for correctly verifying more programs. 5.5 Evaluation 5.5.1 Experimental Setup Research Questions: We experiment with two use-case scenarios. The first sce- nario is when a non-expert user does not know a reliable default configuration with which to start. To simulate this scenario, we start our search with a ran- dom configuration (with replications). We compare our search results with each tool?s comp-default results. Recall that the comp-default of each tool was man- ually tuned by program analysis experts for the verification tasks; therefore, this experiment answers the following research question: ? RQ1: How close can auto-tune get to the expert knowledge starting from random configurations? The second scenario is when there is a reliable default configuration provided to the user. Although often precise, the default configuration may not be conclu- sive on many verification tasks. To simulate this scenario, we start our search with 97 each tool?s comp-default only when the configuration is inconclusive. We evalu- ate auto-tune?s results using the scoring schema from the SV-COMP [101]. This experiment answers the following research question: ? RQ2: Can auto-tune improve on top of expert knowledge? Next, we conduct an in-depth investigation to explore how our approach per- forms using the different neighbor generation strategies described in Section 5.2.3 and the different machine learning approaches described in Section 5.2.2 to answer the following research questions: ? RQ3: How do different neighbor generation strategies affect auto-tune?s performance? ? RQ4: How do different machine learning techniques (i.e., classification and regression) affect auto-tune?s performance? Numerical Value Mapping for Analysis Results: For the regression ap- proach, we map the analysis results to the following numeric values; true negative:0, true positive:1, inconclusive:50, false positive:99, and false negative:100. This map- ping will also enable us to compare the training performance of both approaches (Section 5.5.5). Metrics: We used three metrics in our evaluation: (i) the number of correct anal- ysis results, (ii) precision as the percentage of correct results among all conclusive results), and (iii) the SV-COMP score. 98 Cross-validation: We perform 5-fold cross-validation by splitting the benchmark programs into five sets and using four sets for training and one set for testing, rotating the testing set so that each set is used for testing once. Threshold: We experimented with five threshold values: ? = {0.1, 0.2, 0.3, 0.4, 0.5} for classification and ? = {5, 10, 15, 20, 40} for regression. Environment: All of our experiments were conducted on two servers. One server has 24 Intel Xeon Silver 4116 CPUs @ 2.10GHz and 144GiB RAM, and the other has 48 Intel Xeon Silver 4116 CPUs @ 2.10GHz and 192GiB RAM. Both servers run Ubuntu 16.04 LTS. Search Runs: In total, we run auto-tune for 72030 verification tasks; that is, 3 neighbor generation strategies ? 2 machine learning techniques ? 5 threshold values ? a total of 2401 programs (1000 and 665 C programs for CBMC and Symbiotic, respectively, 368 programs for JayHorn and JBMC ). 5.5.2 RQ1: How close can auto-tune get to the expert knowledge? Table 5.3 compares the auto-tune results for the first use-case scenario with comp-default results. In this scenario, the search starts with a random configura- tion, which can potentially have very bad performance (see Figure 5.3). Recall that we presented multiple auto-tune settings for neighbor generation and predictive model training that may present precision, soundness, and performance tradeoffs. Here we analyze results and make observations about two settings: S1:base-classification 99 # of Tool correct incorrect Precision CBMC 426 1 99.76 Symbiotic 217 2 99.08 comp-default JayHorn 176 2 98.86 JBMC 331 0 100 CBMC 427 29 93.75 Symbiotic 256 6 97.70 auto-tune:S1 JayHorn 276 23 92.30 JBMC 264 34 88.59 CBMC 742 227 76.57 Symbiotic 248 9 96.49 auto-tune:S2 JayHorn 290 43 87.08 JBMC 292 42 87.42 Table 5.3: The number of correct and incorrect results and the computed precision for two auto-tune settings, named as S1 and S2, with base neighbor generation strategy and classification model, ? = 0.1 for S1, and ? = 0.4 for S2. with ?=0.1, which produces overall the most precise results, and S2:base-classification with ?=0.4, which produces the highest number of correct results 10. First, we observe that auto-tune:S1 can produce comparable number of cor- rect results as comp-default with high precision. For CBMC, auto-tune:S1 and comp-default produce the same number of correct results while auto-tune:S1 has 28 more incorrect results, i.e., 94% precision. For Symbiotic, auto-tune:S1 pro- duced 49 more correct results than its comp-default configuration while still main- taining a good precision of 98%. Similarly for JayHorn, auto-tune:S1 produced 103 more correct results with 92% precision. For JBMC, however, auto-tune:S1 produced 67 fewer correct results than its comp-default with a precision of 89%. Second, we find that auto-tune:S2 can produce more correct results than comp-default at the cost of some precision loss for the three of the subject tools. CBMC, Symbiotic, and JayHorn all outperform comp-default in terms of the num- 10Figures 5.4 and 5.5 present the results for all experiments 100 ber of correct results with 316, 38, and 114 additions; 76.57%, 96.49%, and 87.42% precisions respectively. We acknowledge that JBMC comp-default, as the first place winner of SV- COMP?19, already has good performance with only 37 inconclusive results and no incorrect results. Figure 5.3 shows that, in contrast to the other tools, JBMC comp-default is actually the best performing configuration among the ones we used in our experiments. Finally, we show that auto-tune significantly outperforms the median results of configurations in the datasets. auto-tune:S1 outperforms the dataset median by 153, 61, 176, and 194 more correct results for CBMC, Symbiotic JayHorn, and JBMC, respectively. The dataset median precision for these tools is 70.54, 94.76, 92.25, and 47.64 (respectively). auto-tune:S1 also outperforms these median values with 93.75%, 97.70%, 92.30%, and 88.59% precision. This result suggests that auto-tune can potentially improve over many configurations in the configuration space. Overall, we believe auto-tune can significantly improve the analysis outcomes over many initial configurations, producing similar or more correct results than comp-default at the cost of some precision. 5.5.3 RQ2: Can auto-tune improve on top of expert knowledge? Figures 5.4 and 5.5 show the number of completed tasks (y-axis) for varying threshold ? values (x-axis), for the second use-case scenario; i.e., the search runs 101 base greedy conservative 400 ?3947 ?4227 ?3837 ?3909 ?2498 ?2553 200 ?1218 ?1172 ?398 ?388 ?300 ?733 ?45 ?117 ?171 0 ?200 55 24 56 25 49 76 78 79 78 79 40 47 45 45 45 45 20 TN 0 TP 120 ?351 ?429?387 ?305 ?350?196 ?464?230 FN FP 80 ?115 ?98 ?132 ?93 ?281 40 ?91 ?83 0 ?40 ?80 ?113 ?131 ?133 ?107 10 ?26 ?24 ?24 ?24 ?26 ?26 4 2 ?14 2 0 threshold Figure 5.4: auto-tune improvements with classification models as the number of conclusive analysis tasks for varying threshold values. The search runs only-if comp-default can not complete. Each stacked bar shows the distribution of results for each neighbor generation strategy. The number on top of each bar is difference between auto-tune?s score and the comp-default configuration score. only-if comp-default can not complete. These figures include the results of all the auto-tune settings we experimented for comparison. Each stacked bar shows the auto-tune results for a specific setting and tool. The blue and orange portions represent the correct and incorrect results, respectively. The numbers on top of the bars represent the difference between the score auto-tune would have achieved and the scores that the tools achieved in the competition (we compare to the scores from the SV-COMP from which we obtained the respective benchmarks, i.e., SV- COMP?18 for CBMC and Symbiotic and SV-COMP?19 for JBMC and JayHorn). For example, the leftmost bar in the bottom leftmost of Figures 5.4 is for JBMC and 102 Number of conclusive analysis runs 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 cbmc symbiotic jayhorn jbmc base greedy conservative 300 ?2514 ?2392 200 ?956 ?1120 100 ?436 ?445 ?251 ?150 ?67 ?36 ?64 ?90 ?72 ?32 ?78 0 ?100 43 43 41 20 37 34 15 10 17 17 17 17 17 5 TN 0 TP 40 ?72 ?103 ?136 ?118 ?21 ?21 ?39?42 ?26 FN?173 FP 20 ?11 ?15 ?14 ?12 3 0 5.0 ?7 ?105 8 ?76 ?123 ?91 ?90 4 5 ?10 5 6 2.5 3 3 ?29 0.0 ?2.5 ?5.0 threshold Figure 5.5: auto-tune improvements with regression models. auto-tune:base-classification with ? = 0.1. This run has 12 correct results, and 8 incorrect results leading to 107 points decrease in SV-COMP?19 score over comp-default. For three out of four verification tools, i.e., Symbiotic, JayHorn, and JBMC, auto-tune led to improvements in the competition score in some settings with no additional incorrect results. The competition is scored as follows: 2 points for verifi- cation of a safe program (i.e., a true negative or TN), 1 point for finding a bug in an unsafe program (i.e., a true positive or TP), 0 points for an inconclusive analysis run (i.e., unknown or UNK), -16 points for finding a non-existing bug in a safe program (i.e., false positive or FP), and -32 for verification of an unsafe program (i.e., false negative or FN). 103 Number of conclusive analysis runs 5 10 15 20 40 5 10 15 20 40 5 10 15 20 40 cbmc symbiotic jayhorn jbmc auto-tune improves upon the scores in all Symbiotic runs from SV-COMP with a maximum improvement of 79 points, in one JayHorn run with an improve- ment of 3 points, and in ten JBMC runs with a maximum improvement of 8 points. Recall that all of these improvements are significant as they improve on top of al- ready expert-tuned configurations. Specifically, auto-tune results on JBMC mean that we can improve upon the first place winner of SV-COMP?19, which can already produce correct results on 90% of the programs. For CBMC, however, there was no auto-tune run with a score improvement due to the big penalty for incorrect results. We also observe that auto-tune increases the number of correct results in all runs (with the exception of the greedy-regression setting for Symbiotic). This, however, does not mean improved competition score as auto-tune pays the large penalty for the incorrect results in general. Last, all auto-tune settings that do not improve the competition score have lower precision compared to their performance in the first use-case scenario (RQ2) ?including S1 and S2. This result suggests that the tasks that comp-default configurations could not complete are harder to analyze, and the verification tools are less likely to produce correct results for them. 5.5.4 RQ3: How do different neighbor generation strategies affect auto-tune?s performance? We use Figures 5.4 and 5.5 to investigate how the neighbor generation strate- gies affect auto-tune?s performance. On the overall, the conservative strategy leads 104 to more precise analysis runs with fewer conclusive results, while the base strategy leads to a higher number of conclusive results at the expense of lowered precision. We now present observations about each of the individual strategies. Base: Although the numbers of correct results are consistently high using the base strategy, the precision is dependent on the tools. This is mostly due to the nature of the configuration options that the tools have; i.e., some tools? configurations are more likely to lead to incorrect results than others (Figure 5.3). For CBMC, JayHorn, and JBMC, all base runs had incorrect results, causing no improvement. For Symbiotic, however, there were very few incorrect results that the score improvement stayed positive. Conservative: We observe that the runs using the conservative strategy achieve high precision but produce fewer conclusive results compared to the base strategy. All conservative runs achieve 100% precision (regression only) for Symbiotic and an average of 94% precision for other tools. For JBMC , the conservative strategy led to fewer conclusive results when combined with the classification approach (discussed in Section 5.5.5). Greedy: In greedy runs, we observe that the behavior varies. The results are similar to the base for CBMC and JayHorn, while they are similar to conservative for JBMC. This is attributable to the options we decide to forbid (or allow) using the screening study findings. CBMC and JayHorn each have two option values forbidden with the greedy strategy; therefore, the results are closer to base. While 105 for JBMC, the forbidden option value set of greedy is similar to that of conservative. To further investigate how these strategies affect our search algorithm, Ta- ble 5.4 shows the median number (and SIQR, in smaller font) of configurations generated c?, accepted c, determined to be the best so far c?, and used to run the analysis tool (line 19 of Algorithm 2) only for the conclusive regression runs to bet- ter isolate the effect of the strategies. When auto-tune cannot find a configuration that leads to a conclusive result, it generates 115 124 configurations (always the same), accepts 57 537 of them, but none of them gets used to run the analysis tool (median). As a general trend, we observe that the search completes very quickly. The median number of configurations generated across all search runs is 16. The overall acceptance rate is 88%, and there is only one analysis run (last column) per auto-tune run. These results suggest that 1) all neighbor generation strategies could generate a new configuration that is potentially superior to the current, and 2) auto-tune can quickly locate a configuration that leads to a conclusive analysis result. Last, we observe no trends in the number of configurations generated, ex- cepted, and run with each neighbor generation strategy that apply for all tools. Therefore, we conclude by saying that the effect of the neighbor generation strategy depends on the behavior of the configuration options the analysis tools have. 106 Neighbor # of configurations Tool strategy generated accepted best run base 25 28 21 21 3 2 1 0 CBMC greedy 28 28 23 22 3 2 1 0 conservative 18 25 15 20 2 1 1 0 base 13 12 12 10 1 0 1 0 greedy 9 3 9 3 1 0 1 0 Symbiotic conservative 13 9 13 9 1 0 1 0 base 8 10 7 8 1 0 1 0 JayHorn greedy 7 7 6 5 1 0 1 0 conservative 14 18 12 15 1 0 1 0 base 90 109 79 95 1 0 1 0 JBMC greedy 100 152 88 132 1 0 1 0 conservative 72 83 64 70 1 0 1 0 Table 5.4: The number of configurations generated (c?), accepted (c), improved the best so far (c?), and used for running tool. Classification Regression inaccuracy(%) mean abs. error Tool fitness filter fitness filter CBMC 11.95 0.89 18.11 0.17 11.62 0.48 18.89 0.45 Symbiotic 23.39 2.42 21.09 4.60 12.70 1.23 5.24 0.82 JayHorn 24.15 1.48 32.47 1.35 13.19 0.40 17.04 1.64 JBMC 18.51 1.00 32.97 2.74 13.95 0.40 26.00 1.11 Table 5.5: Training performance. 5.5.5 RQ4: How do different machine learning techniques (i.e., clas- sification and regression) affect auto-tune?s performance? Now, we discuss how the two different machine learning techniques compare using Figures 5.4 and 5.5. We observed that, overall, classification runs led to more complete results but with less precision compared to the regression runs, while many of the improvements we discussed in RQ2 are achieved with regression. To better understand these results, Table 5.5 shows the inaccuracy of classifica- 107 tion and the mean absolute error of regression training, using the median and SIQR values (in smaller font) for 25 training runs (5-fold cross-validation ? 5 random seeds). First, we see that for CBMC, JayHorn, and JBMC runs, fitness function training has better performance than filter training. Since the fitness function model M is trained on the entire ground truth dataset, while the filter F is trained only on the conclusive parts of the dataset, these results suggest that the extra data avail- able in the fitness function training have additional learnable patterns compared to the conclusive subset for these tools. For Symbiotic, however, the trend is reversed. The filter training has better performance than fitness function, suggesting that the fitness function model was unable to learn any additional patterns in the conclusive part of the datasets. Last, although we cannot precisely compare the training performance of re- gression and classification, we observe that for three tools, Symbiotic, JayHorn, and JBMC, the regression error is significantly lower than the inaccuracy of classification. 5.5.6 Threats to Validity There are several threats to the validity of our study. First, the benchmarks may not be representative of the programs that naturally occur. They, however, provide good coverage of program verifications scenarios by exercising a lot of intri- cate code patterns. Also, they have been used in the annual SV-COMP for many years to assess the performance of 30+ program verification tools. Another threat about the size of the datasets. Our ground-truth datasets 108 consist of 55K to 295K data points, which may not be large enough to cover many potential interactions between analysis options and program features. We plan to address this threat by incorporating more benchmark programs into our dataset in future work. Next, we applied auto-tune only to four program verification tools for veri- fying two programming languages. In the future work chapter we will discuss some potential directions to create more ground-truth datasets cover more programming languages. The tools we studied, however, are good representative of the state- of-the-art in the research field implementing fundamental program analysis and verification techniques. CBMC and JBMC implement bounded model-checking technique for C and Java programs, respectively. JayHorn also implements a model- checking technique using Horn clauses for Java programs. Symbiotic implements symbolic execution11. 5.6 Attributions and Acknowledgments The work presented in this chapter is done with collaborators Austin Mordalh, a Ph.D. student at The University if Texas at Dallas (UTD), Dr. Shiyi Wei, and the academic advisors of Ugur Koc. I, Ugur Koc, designed and developed the auto-tune approach. I also designed the experimental setup and evaluation framework with the guadiance other collaborators. 11See Appendix A for more details on the program analysis techniques 109 Chapter 6: Related Work In this chapter, we discuss work related to the approaches we described in Chapters 3, 4, and 5. First, we discuss research that uses machine learning to classify false positive static analysis reports. Then, we discuss the broader application of machine learning techniques, specifically, natural language processing (NLP), to the source code to solve a variety of software engineering problems. 6.1 Related Work for Automatic False Positive Classification Z-ranking by Kremenek and Engler [7] is one of the earliest approaches ad- dressing false positive static analysis reports. The z-ranking technique ranks analysis reports, using the frequency counts of successful and failed checks. The observations underlying the design of the approach are that; bugs are rare, and bad analysis de- cisions can lead to explosions of spurious false positive bug reports. Therefore, false positive bug reports are likely to refer to more failed checks than successful ones. Although this work provides interesting insights, it requires altering a static analy- sis tool to gather the successful and failed checks. Furthermore, the successful and failed checks are not well defined for all static analysis algorithms and problems. More recent work aims to filter false positive analysis reports using machine 110 learning based on hand-engineered features [45, 46, 9, 10]. Yuksel and Sozer used ten features, extracted from bug reports and analyzed programs, to classify static analysis alerts for C++ programs [9]. They experiment with 34 different machine learning algorithms and report 86% accuracy in classification. Tripp et al. simi- larly identify 14 features for false positive XSS reports generated for JavaScript programs. They report significant improvements in precision of the static analy- sis using the classifiers as a post-analysis filter [10]. In our empirical assessment (Chapter 4), we evaluated this approach by adopting these 14 features for Java pro- grams, attempting to hew closely to the type of features used in the original work. Our approach differs from these work in that, we do not manually extract features. Instead, we use sophisticated machine learning techniques that can automatically capture important signals in programs. In a more recent study, Raghothaman et al. introduce a user-guided Bayesian inference approach to filter false positive reports from deductive rule-based static analysis algorithms [114]. We did not include this work in our evaluation because it works on a per-program basis, requires user input, and is designed for certain static analysis techniques. To the best of our knowledge, there are no prior empirical studies of machine learning approaches for false positive static analysis report classification. Further- more, none of the existing work in this line of research has studied our second application scenario, which focuses on demonstrating the generalizability of the ma- chine learning models. Next, we discuss the broader application of natural language processing techniques to software engineering problems. 111 6.2 Natural Language Processing (NLP) Techniques Applied to Code NLP techniques achieved a great level of success for the tasks involving daily life language because it is repetitive and predictable. Hindle et al. [115] used NLP techniques on source code, arguing that source code written by human developers is similarly repetitive and predictable as well. Therefore, NLP techniques can poten- tially be very successful in learning properties from code. Expending on this work, multiple researchers have successfully applied NLP techniques to programs to tackle a wide range of software engineering problems such as clone detection [116], API mining [117, 118], variable naming and renaming [119, 120], code suggestion and completion [121, 122, 123], bug detection [75]. Nguyen et al. [122] introduce a new statistical semantic language modeling for source code to handle some development tasks like code suggestion and completion. Tu et al. [121] add a cache component into the n-gram model to exploit local prop- erties in the code. Raychev et al. [123] use n-gram and neural network language models for code completion for API calls. Allamanis et al. [120] propose a new neural network-based model with word embeddings to suggest meaningful method and class names for Java code. In another related work, Allamanis et al. [124] use a deep convolutional neural network for source code summarization (a sort natural language summary for a method/function). Fowkes et al. [118] learn API usage patterns using probabilistic modeling. Gu et al. [125] solve the same problem using a deep recurrent neural network. White et al. [116] use recurrent neural networks (i.e., like our work) to detect code clones. 112 Above, we listed some sample work that demonstrated the successful appli- cation of NLP techniques for source code processing to solve software engineering problems. This list, however, is not meant to be comprehensive. For more compre- hensive coverage of the literature, we refer the readers to the recent survey of such research efforts by Allamanis et al. [126]. Note that none of the mentioned work solves the false error report classification problem. 6.3 Selection and Ranking of Static Analyses Next, we discuss the related work for the auto-tune approach. To the best of our knowledge, ours is the first work to use machine learning to select a good configuration of a tool that is likely to lead to the desired analysis result for a given verification task. However, we find two lines of research closely related to our approach: (i) selection and ranking of static analyses; (ii) adaptive static analysis techniques. The applications most relevant to our work are the papers that select strategies within a static analysis tool [100, 127], and those that rank static analysis tools based on their likelihood of producing the desirable analysis result for a given task [30, 31, 128]. Beyer and Dangl presented a feature-based approach to select from three manually-defined verification strategies for CPAchecker [129, 100]. A strategy is a sequence of verifiers defined within CPAchecker. Their approach uses four boolean program features to define the selection heuristic. Richter and Wehrheim presented PeSCo, an approach that uses a machine learning model to predict the 113 best ordering of five CPAchecker strategies [127]. Although these two studies are addressing a related problem, our approach differs from them in two aspects. First, our meta-heuristic search algorithm leverages machine learning models that are applicable to different static analysis tools. Second, the large configuration space of static analysis tools presents a challenge that drives us to design an efficient search-based algorithm. Tulsian et al. presented Mux to select the optimal verification tools to run on Windows device drivers [30]. Their learning-based approach trains Support Vector Machine (SVM) and regression models on a verification repository to generate an algorithm selector. Then Mux extracts features from the device driver to be verified and uses an algorithm selector that predicts the optimal tool to run on it. The focus of the prediction is running time. In our work, we focus on the large configuration space of static analysis tools and predict the configuration that is likely to lead to conclusive and correct results. Czech et al. presented a method to predict the rankings of SV-COMP tools using a machine learning approach [31]. They use kernel methods to predict rank- ings based on a graph representation of verification tasks. Similarly, Demyanova et al. presented a feature-based machine learning method to construct a portfolio solver [128]. The portfolio solver uses heuristic preprocessing to select one of the existing tools and demonstrates that their approach would hypothetically win both SV-COMP 2014 and 2015. The above approaches focus on selecting from a list of available tools, while our research complements these work by considering the configurability of each static analysis tool. 114 6.4 Adaptive Sensitivity Static Analysis Several recent research studies have focused on predicting sensitivity of an analysis algorithm to be selectively applied on the target program. Context-sensitivity is the mostly studied analysis option [21, 22, 24, 25]. Very simply, context-sensitivity is about how to analyze each method in a program. Analyzing the method only once, independent of the context it is called with would be context-insensitive but fast, while analyzing a method separately for each context it is called with would be context-sensitive but very expensive. Jeong et al. presented an approach to automatically determine which parts of a program to apply context-sensitivity [24]. They hand-engineered low-level, easy-to-obtain features in Java methods and statements. They then developed a learning model that synthesized high-level features by combining low-level features with boolean formulae, and then returned a parameterized heuristic function that indicates the level of context-sensitivity necessary for each method. Similarly, Li et al. presented Scaler, a scalable pointer analysis framework with self-tuning context-sensitivity [22]. Scaler allows the user to specify a maximum size of the points-to information generated, and then performs the most precise analysis within that memory bound. It achieves this by running an inexpensive pre-analysis that extracts information sufficient to make an estimate of how expensive various context- sensitivities would be on different program methods. Scaler then selects a context- sensitivity level such that a precise analysis is yielded while still falling within the specified memory bound. Li et al. also presented a method to automatically intro- 115 duce context-sensitivity as needed on parts of programs into an otherwise context- insensitive analysis [25]. The authors manually identified three value-flow patterns, then constructed Zipper, which recognizes these patterns in Java code and guides analysis by identifying methods that would benefit from context-sensitivity. Wei and Ryder presented an adaptive context-sensitive analysis for JavaScript [21]. They extracted eight features of JavaScript functions from the points-to graph and call graph and then developed learning-based heuristics informing how those eight features should affect the choice of context-sensitivity. Their adaptive analysis then uses these heuristics and the extracted features to choose a context-sensitive analysis for each function. Other analysis options, such as flow-sensitivity, have also been used to develop selective static analysis (e.g., [23]). Our work similarly stud- ies the relationship between program features and analysis algorithms to achieve a good balance between performance, precision, and soundness. But we consider a wide range of configuration options, while all the work mentioned above focuses on certain analysis algorithms and configuration options. In addition, instead of de- veloping a selective analysis, our approach aims to help the users of existing static analysis tools by automatically configuring the tools. Another research area combining machine learning and program analysis is using machine learning to learn a static analysis. Oh, Yang, and Yi presented a strategy for using Bayesian optimization to learn the parts of a program to which one should apply precision-improving techniques when performing static analysis [23]. Their approach extracts features from variables in the program to determine which of those variables would benefit the most from higher-precision analyses and 116 then applies those analyses to some number of those features. Smaragdakis et al. observed that unsoundness is necessary in order to scale static analyses [20]. Heo, Oh, and Yi observed that this unsoundness is uniform in modern bug-detecting static analyses, causing a high number of false alarms. They presented a method to use machine learning to selectively make a static analysis unsound over certain loops in a program [130]. They trained classifiers to identify harmless loops (i.e., loops that, when analyzed unsoundly, reduce false positives and introduce no false negatives) in the programs under analysis. The classifier then identifies harmless loops in programs, which the analyzer unrolls once and replaces with an if-statement. Instead of modifying an existing static analysis algorithm or designing new analysis algorithms, our approaches focus on improving the usability of existing static analysis tools. 117 Chapter 7: Conclusion In this dissertation, we presented machine learning-based approaches to ad- dress two critical usability issues that static analysis tools face: false positive results and proper tool configuration. False positive reports are one of the major reasons developers give for not using them in their software development practice [1]. To address this issue, we presented a learning approach to find program structures that cause the state-of-the-art static analysis tools to emit false error reports, and to filter out such false error reports with a classifier trained on the code (Chapter 3) [71]. In particular, we designed and developed two machine learning models: a Bayesian inference-based model and an LSTM model; and two code reduction techniques: method body and backward slice. To evaluate the effectiveness of the approach, we conducted a case study of a widely-used static analysis tool for Java web security checks, i.e., FindSecBugs. In the case study, we discovered interesting signals involving Java collection objects with the Naive Bayes model. Investigating these signals, we found that FindSecBugs cannot successfully reason about very simple usage scenarios of Java collection objects like HashMap and ArrayList. FindSecBugs likely considers the entire data 118 collection as tainted if there are any tainted objects put into the collection. This over-approximation results in many false positive reports, which can be avoided if static analysis developers improve their analysis of collection classes in the future. With the LSTM model, we achieved 89.6% and 85% accuracy in classification for the method body and the backward slice datasets, respectively. In fact, using the LSTM model trained on the backward slice dataset as a false positive result filter, we removed 81% of the false positives from the developers? view, while only mistakenly removing 2.7% of real bug reports, which improved the tool?s overall precision from 49.6% to 90.5%. Furthermore, we analyzed the output LSTM produced for the tokens of two input programs. With this analysis, we (i) showed that long-term dependencies exist in the data, (ii) demonstrated LSTMs? capability of inferring the context in which the tokens appear, and (iii) showed how LSTMs output values agree with the findings of Naive Bayes model. We believed the results of this case study suggested that the approach was promising. Therefore, we extended the approach and the empirical evaluations in several ways (Chapter 4) [18]. We presented the first empirical study that evaluates four families of machine learning approaches, i.e., HEF, BoW, LSTM, and GGNN, for classifying static analysis reports to filter false positives from true positives. Moreover, we introduced new code transformation routines for preparing programs as inputs to these machine learning approaches. We used three datasets in our experiments. The first dataset is built from the OWASP benchmark using the SQL injection vulnerabilities only (the same dataset 119 we used in the case study in Section 3.3). The other two datasets are built from the real-world benchmark programs we collected for this study. To create these datasets, we manually reviewed 400 analysis reports to label them as true or false positive. In this manual review process, we made two critical observations about the scenarios in which false positives occur. These observations suggest that it is crucial to create more real-world benchmarks to assess static analysis tools and machine learning-based approaches like ours. We compared 13 machine learning approaches from four families using the OWASP and real-world datasets under two application scenarios. The results of our experiments suggest that the LSTM approach generally achieves better accuracy. We also observed that, across all approaches, the second application scenario in which the training is done with one set of programs, and the models are tested on other programs is more challenging. It requires learning the symptoms of true/false positive reports that hold across programs. All of the learning approaches performed significantly lower in this application scenario compared to the first application sce- nario in which training and test samples coming from the same programs. Particu- larly in this application scenario, we observed that more detailed data preparation with abstraction and word extraction leads to significant increases in accuracy while not causing any significant drops in the first application scenario. LSTM, the best performing model, achieved slightly above 80% accuracy in classification with very precise data preparation in this application scenario. We also showed that there could be a higher variance in recall and precision than in accuracy. We conjecture this is because the recall and precision are not directly related to the loss function 120 being optimized in training. Overall, the results of our empirical evaluations suggest that we can effectively distinguish and filter out a majority of false positive bug reports without drastically affecting the number of true bug reports. Next, we addressed the configurability challenge of static analysis tools (Chap- ter 5). We presented auto-tune, a meta-heuristic search approach to automatically configure static analysis tools. This approach is novel in that it uses machine learn- ing models both as fitness functions to explore large configuration spaces and as filters to avoid selecting expensive analysis runs that will likely produce false results. Note that these filters are very much like the false positive classification models we learned in Chapters 3 and 4. There are two key differences, however; (i) in addition to the program representations, they also learn about the tool configurations and (ii) instead of just a classification result, they output a continues numeric score. We applied auto-tune to four popular verification tools: CBMC and Symbi- otic for C programs; JayHorn and JBMC Java programs, and evaluated its per- formance compared to how these tools did in the software verification competition SV-COMP in two scenarios. In the first scenario, starting the search with a ran- dom configuration, we examined how close auto-tune?s performance can get to the expert knowledge by taking the competition configurations of these subject tools, i.e., comp-default, as the reference point for the expert knowledge. We found that auto-tune was able to select configurations comparable to comp-default. For three out of the four subjects, auto-tune could increase the number of correct analysis results with a reasonably small loss in precision. The other subject, JBMC , had 121 already completed 94% tasks correctly, which is a very high percentage. In the second scenario, we run the search only if the comp-default configu- ration fails to produce a conclusive analysis result. In this case, auto-tune could increase the number of correct analysis results for all subjects. For three subjects, these improvements also translated into SV-COMP score improvements in some set- tings of auto-tune. For one subject, CBMC , however, there was no auto-tune run with SC-COMP score improvement due to the substantial penalty for the few false results auto-tune generated. We further investigated the impact of two design choices of the auto-tune approach: neighbor generation strategy, and machine learning model (for varying threshold values). For neighbor generation, we observe that the conservative strat- egy leads to more precise analysis runs with fewer conclusive results, while the base strategy leads to a higher number of conclusive results at the expense of drops in precision. For the machine learning models, the classification runs led to more com- plete results but with less precision compared to the regression runs, while many of the improvements we discussed in RQ2 are achieved with regression. Overall, the results of the empirical evaluations suggest that the auto-tune approach can improve the performance of program verification tools by automatically configuring them for given verification tasks. We believe that the approaches we proposed, developed, and empirically eval- uated will help software developers to incorporate static tools into their development practice, i.e., improving the adoption of the verification tools. Furthermore, the in- sights we gathered through dataset creation and empirical evaluations will also help 122 static analysis developers to enhance the performance of their tools. 123 Chapter 8: Future Work In the chapter we discuss potential directions for future work to extend the approaches we presented. We start with false positive detection. 8.1 Future Work for False Positive Detection The major limitation for extending the research in this direction is the avail- ability of ground-truth datasets. The research community needs, but currently lacks, an extensive repository of real-word datasets to enable new research into machine learning-based approaches to program analysis and understanding. We created an example dataset of this kind and used it in our systematic assessment study [18]. The results of this study and further observations we made underlined the significance of having real-world examples. However, this dataset was a somewhat ad-hoc at- tempt, and no individual effort will be sufficient to create such a repository datasets at scale. The research community needs to pay attention to this shortcoming. An exciting research direction to create such datasets is to study crowdsourc- ing techniques. Crowdsourcing has already been well-studied and used for many software engineering problems [131]. There are vast online communities formed 124 around crowdsourcing platforms like Topcoder1 and bugcrowd2. Similar platforms can be promising for creating these datasets. Once the datasets are created, further machine learning studies can be done. For instance, we believe recursive neural networks [132] can be useful for learning on abstract syntax tree (AST) representation of the code. Also, although we experi- mented with GNNs, their complexity might require learning with larger datasets for tuning the hyper-parameters effectively. Therefore, further investigation of GNNs with more data is necessary. Moreover, as we discussed in Chapter 2, different machine learning algorithms have different biases and therefore their performance can be dependent on the nature of data. Another potential direction is to explore these biases and the nature of the data for combining different machine learning approaches with a voting scheme. Last, the approaches we presented can be extended to become a semi-supervised incremental online service that static analysis developers and users can use to im- prove the tools quality, performance, and thereby practicality. 8.2 Future Work for auto-tune The availability of real-world datasets is a limitation also for this work. Other than crowdsourcing direction, program analysis competitions like SV-COMP [101] present an excellent opportunity to extend the datasets to cover more kinds of software bugs and programming languages. 1https://www.topcoder.com 2https://www.bugcrowd.com 125 With the availability of more datasets, we think of two potential directions to extend the auto-tune approach. The first direction involves designing and develop- ing more sophisticated machine learning architectures that can learn more structural information from programs? code, along with the configuration options of the anal- ysis tools. The second direction involves integrating more search algorithms into auto-tune, such as A* search, genetic algorithms, and tabu search [133], to see whether the configuration space can be explored more efficiently and effectively. To adopt simulated annealing, we incorporated two machine learning models both as fitness functions and false result filters. Other search algorithms might, however, require designing other machine learning-based components. 126 Appendix A: Program Analysis Techniques This appendix provides brief background information on the program anal- ysis techniques implemented or used the static analysis tools we have studied in Chapters 3, 4, and 5. A.1 Taint Analysis 1 public class MyServlet extends HttpServlet { 2 public void doPost(HttpServletRequest request){ 3 String param = ""; 4 Enumeration headers = request.getHeaders("OFFSET"); 5 if (headers.hasMoreElements()) { 6 param = headers.nextElement(); // tainted data 7 } 8 String sql = "SELECT id, name FROM products ORDER BY name LIMIT 20 OFFSET " + param; 9 Connection con = DatabaseHelper.getConnection(); 10 con.prepareCall(sql).execute(); // security-critical operation on database 11 } // end doPost 12 } /* end class*/ Figure A.1: An example code vulnerable for SQL injection. Taint analysis can be seen as a form of information flow analysis. Information flow can happen with any operation (or series of operations) that uses the value of an object to derive a value for another. In this flow, if the source of the flow is 127 untrustworthy, we call the value coming from that source ?tainted?. Taint analysis aims at tracking the propagation of such tainted values in a program. A highly popular application of this technique is to perform security checks against injection attacks. With taint analysis, static analysis tools can check If tainted values can reach to security-critical operations in a program. For example, running an SQL query on the database is usually a security-critical operation, and data received from untrustworthy resources should not be used to create SQL state- ments without proper sanitization. Consider the short program in Figure A.1. This program is vulnerable to SQL injection attacks. The tainted value received from an HTTP request object has been used to create an SQL statement that gets executed on the database (respectively, at lines 7, 10, and 12). FndSecBugs, the subject static analysis tool in Chapters 3 and 4, can effectively find such security vulnerabilities with taint analysis. A.2 Model-checking Model-checking [134] is a program analysis (a formal method more specifi- cally) technique that uses finite-state models (FSM) of software programs to check the correctness properties for given specifications written in propositional temporal logic [135]. Model-checking tools exhaustively search the state space to find paths from start states to invalid states. If such paths exist, the tool can also provide a counter-example that shows how to reach to the invalid state. However, model-checking tools face a combinatorial blow-up of the state space 128 as the programs get bigger. One common approach to address this issue is to bound the number of steps taken on the FSM. This approach is called bounded model- checking (BMC). CBMC [99] and JBMC [36] implement this approach to verify (or to find bugs in) C and Java programs. JayHorn also implements model-checking for specification violations in Java programs. JayHorn generates constrained Horn clauses (CHC) [136] as verification conditions and passes them to a Horn engine that checks their satisfiability. CHCs are rule-like logic formulae specifying the pre-conditions on the parameters of meth- ods, post-conditions on return values of methods, and list of variables in the scope for each program location. A.3 Symbolic Execution Symbolic execution is a technique for analyzing a program to determine what inputs cause each part of a program to execute. Symbiotic implements the symbolic execution technique to verify (or to find bugs in) C programs. Symbiotic interprets a given program by assuming symbolic values for inputs rather than obtaining concrete inputs as normal program execution would (like running a test case). Symbiotic computes expressions of the inputs in terms of those symbols for the expressions and variables in the program. These expressions are called path-conditions, and they denote the possible outcomes of each conditional branch in the program. 129 Bibliography [1] Brittany Johnson, Yoonki Song, Emerson Murphy-Hill, and Robert Bowdidge. Why Don?T Software Developers Use Static Analysis Tools to Find Bugs? In Proceedings of the 2013 International Conference on Software Engineering, ICSE ?13, pages 672?681, Piscataway, NJ, USA, 2013. IEEE Press. [2] Guy Erez, Eran Yahav, and Mooly Sagiv. Generating concrete counterexam- ples for sound abstract interpretation. Citeseer, 2004. [3] Xavier Rival. Understanding the Origin of Alarms in Astree. In Static Anal- ysis, Lecture Notes in Computer Science, pages 303?319. Springer, Berlin, Heidelberg, September 2005. [4] Bhargav S. Gulavani and Sriram K. Rajamani. Counterexample Driven Re- finement for Abstract Interpretation. In Tools and Algorithms for the Con- struction and Analysis of Systems, Lecture Notes in Computer Science, pages 474?488. Springer, Berlin, Heidelberg, March 2006. [5] Youil Kim, Jooyong Lee, Hwansoo Han, and Kwang-Moo Choe. Filtering false alarms of buffer overflow analysis using SMT solvers. Information and Software Technology, 52(2):210?219, February 2010. [6] Benjamin Livshits, Manu Sridharan, Yannis Smaragdakis, Ondr?ej Lhota?k, J Nelson Amaral, Bor-Yuh Evan Chang, Samuel Z Guyer, Uday P Khedker, Anders M?ller, and Dimitrios Vardoulakis. In Defense of Soundiness: A Man- ifesto. Communications of the ACM, 58(2):44?46, 2015. [7] Ted Kremenek and Dawson Engler. Z-ranking: Using Statistical Analysis to Counter the Impact of Static Analysis Approximations. In Proceedings of the 10th International Conference on Static Analysis, SAS?03, pages 295?315, Berlin, Heidelberg, 2003. Springer-Verlag. [8] Yungbum Jung, Jaehwang Kim, Jaeho Shin, and Kwangkeun Yi. Taming False Alarms from a Domain-Unaware C Analyzer by a Bayesian Statistical Post Analysis. In Static Analysis, Lecture Notes in Computer Science, pages 203?217. Springer, Berlin, Heidelberg, September 2005. 130 [9] U. Yu?ksel and H. So?zer. Automated Classification of Static Code Analysis Alerts: A Case Study. In 2013 IEEE International Conference on Software Maintenance, pages 532?535, September 2013. [10] Omer Tripp, Salvatore Guarnieri, Marco Pistoia, and Aleksandr Aravkin. ALETHEIA: Improving the Usability of Static Security Analysis. In Proceed- ings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, CCS ?14, pages 762?774, New York, NY, USA, 2014. ACM. [11] Ugur Koc, Parsa Saadatpanah, Jeffrey S. Foster, and Adam A. Porter. Learn- ing a Classifier for False Positive Error Reports Emitted by Static Code Anal- ysis Tools. In Proceedings of the 1st ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL 2017, pages 35?42, New York, NY, USA, 2017. ACM. [12] Philippe Arteau. Find Security Bugs, version 1.4.6, 2015. http://find-sec-bugs. github.io, Accessed on 2018-01-04. [13] The OWASP Foundation. The OWASP Benchmark for Security Automation, version 1.1, 2014. https://www.owasp.org/index.php/Benchmark, Accessed on 2018-01-04. [14] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735, November 1997. [15] Felix A. Gers, JA??rgen Schmidhuber, and Fred Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):2451?2471, October 2000. [16] Alex Graves. Supervised Sequence Labelling. In Supervised Sequence Labelling with Recurrent Neural Networks, Studies in Computational Intelligence, pages 5?13. Springer, Berlin, Heidelberg, 2012. [17] James Bergstra, Olivier Breuleux, Fre?de?ric Bastien, Pascal Lamblin, Raz- van Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: A cpu and gpu math compiler in python. In Proc. 9th Python in Science Conf, volume 1, pages 3?10, 2010. [18] Ugur Koc, Shiyi Wei, Jeffrey S Foster, Marine Carpuat, and Adam A Porter. An empirical assessment of machine learning approaches for triaging reports of a java static analysis tool. In 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST), pages 288?299. IEEE, 2019. [19] Shiyi Wei, Piotr Mardziel, Andrew Ruef, Jeffrey S. Foster, and Michael Hicks. Evaluating design tradeoffs in numeric static analysis for java. In Program- ming Languages and Systems - 27th European Symposium on Programming, ESOP 2018, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2018, Thessaloniki, Greece, April 14-20, 2018, Proceedings, pages 653?682, 2018. 131 [20] Yannis Smaragdakis, George Kastrinis, and George Balatsouras. Introspec- tive analysis: Context-sensitivity, across the board. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Imple- mentation, PLDI ?14, pages 485?495, New York, NY, USA, 2014. ACM. [21] Shiyi Wei and Barbara G. Ryder. Adaptive context-sensitive analysis for javascript. In 29th European Conference on Object-Oriented Programming, ECOOP 2015, July 5-10, 2015, Prague, Czech Republic, pages 712?734, 2015. [22] Yue Li, Tian Tan, Anders M?ller, and Yannis Smaragdakis. Scalability-first pointer analysis with self-tuning context-sensitivity. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2018, pages 129?140, New York, NY, USA, 2018. ACM. [23] Hakjoo Oh, Hongseok Yang, and Kwangkeun Yi. Learning a strategy for adapting a program analysis via bayesian optimisation. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Program- ming, Systems, Languages, and Applications, OOPSLA 2015, pages 572?588, New York, NY, USA, 2015. ACM. [24] Sehun Jeong, Minseok Jeon, Sungdeok Cha, and Hakjoo Oh. Data- driven context-sensitivity for points-to analysis. Proc. ACM Program. Lang., 1(OOPSLA):100:1?100:28, October 2017. [25] Yue Li, Tian Tan, Anders M?ller, and Yannis Smaragdakis. Precision- guided context sensitivity for pointer analysis. Proc. ACM Program. Lang., 2(OOPSLA):141:1?141:29, October 2018. [26] Ugur Koc, Shiyi Wei, Jeffrey S. Foster, Marine Carpuat, and Adam A. Porter. An empirical assessment of machine learning approaches for triaging reports of a java static analysis tool. In 12th IEEE Conference on Software Testing, Validation and Verification, ICST 2019, Xi?an, China, April 22-27, 2019, pages 288?299, 2019. [27] Enas A Alikhashashneh, Rajeev R Raje, and James H Hill. Using machine learning techniques to classify and predict static code analysis tool warnings. In 2018 IEEE/ACS 15th International Conference on Computer Systems and Applications (AICCSA), pages 1?8. IEEE, 2018. [28] Sunghun Kim and Michael D Ernst. Which warnings should i fix first? In Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, pages 45?54. ACM, 2007. [29] Sarah Heckman and Laurie Williams. On establishing a benchmark for evalu- ating static analysis alert prioritization and classification techniques. In Pro- ceedings of the Second ACM-IEEE international symposium on Empirical soft- ware engineering and measurement, pages 41?50. ACM, 2008. 132 [30] Varun Tulsian, Aditya Kanade, Rahul Kumar, Akash Lal, and Aditya V Nori. Mux: algorithm selection for software model checkers. In Proceedings of the 11th Working Conference on Mining Software Repositories, pages 132?141. ACM, 2014. [31] Mike Czech, Eyke Hu?llermeier, Marie-Christine Jakobs, and Heike Wehrheim. Predicting Rankings of Software Verification Tools. In Proceedings of the 3rd ACM SIGSOFT International Workshop on Software Analytics, SWAN 2017, pages 23?26, New York, NY, USA, 2017. ACM. [32] Rakefet Ackerman and Valerie A Thompson. Meta-reasoning. Reasoning as memory, pages 164?182, 2015. [33] Stefania Costantini. Meta-reasoning: a survey. In Computational Logic: Logic Programming and Beyond, pages 253?288. Springer, 2002. [34] Ji?r?? Slaby?, Jan Strejc?ek, and Marek Trt??k. Checking properties described by state machines: On synergy of instrumentation, slicing, and symbolic execu- tion. In International Workshop on Formal Methods for Industrial Critical Systems, pages 207?221. Springer, 2012. [35] Jiri Slaby, Jan Strejc?ek, and Marek Trt??k. Symbiotic: synergy of instrumen- tation, slicing, and symbolic execution. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 630?632. Springer, 2013. [36] Lucas Cordeiro, Pascal Kesseli, Daniel Kroening, Peter Schrammel, and Marek Trtik. JBMC: A bounded model checking tool for verifying Java bytecode. In Computer Aided Verification (CAV), volume 10981 of LNCS, pages 183?190. Springer, 2018. [37] Temesghen Kahsai, Philipp Ru?mmer, Huascar Sanchez, and Martin Scha?f. Jayhorn: A framework for verifying java programs. In International Confer- ence on Computer Aided Verification, pages 352?358. Springer, 2016. [38] Mohamad Kassab, Joanna F DeFranco, and Phillip A Laplante. Software testing: The state of the practice. IEEE Software, 34(5):46?52, 2017. [39] Kshirasagar Naik and Priyadarshi Tripathy. Software testing and quality as- surance: theory and practice. John Wiley & Sons, 2011. [40] Srinivasan Desikan and Gopalaswamy Ramesh. Software testing: principles and practice. Pearson Education India, 2006. [41] Caitlin Sadowski, Edward Aftandilian, Alex Eagle, Liam Miller-Cushon, and Ciera Jaspan. Lessons from building static analysis tools at google. Commu- nications of the ACM (CACM), 61 Issue 4:58?66, 2018. 133 [42] Caitlin Sadowski, Jeffrey Van Gogh, Ciera Jaspan, Emma So?derberg, and Collin Winter. Tricorder: Building a program analysis ecosystem. In Proceed- ings of the 37th International Conference on Software Engineering-Volume 1, pages 598?608. IEEE Press, 2015. [43] Junjie Wang, Song Wang, and Qing Wang. Is there a ?golden? feature set for static warning identification?: An experimental evaluation. In Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engi- neering and Measurement, ESEM ?18, pages 17:1?17:10, New York, NY, USA, 2018. ACM. [44] Stuart J Russell and Peter Norvig. Artificial intelligence: a modern approach. Malaysia; Pearson Education Limited,, 2016. [45] S. S. Heckman. Adaptive probabilistic model for ranking code-based static analysis alerts. In Software Engineering - Companion, 2007. ICSE 2007 Com- panion. 29th International Conference on, pages 89?90, May 2007. [46] Sarah Smith Heckman. A systematic model building process for predicting actionable static analysis alerts. North Carolina State University, 2009. [47] Yoav Goldberg. Neural network methods for natural language processing. Synthesis Lectures on Human Language Technologies, 10(1):1?309, 2017. [48] A. Sureka and P. Jalote. Detecting Duplicate Bug Report Using Character N- Gram-Based Features. In 2010 Asia Pacific Software Engineering Conference, pages 366?374, November 2010. [49] Stacy K. Lukins, Nicholas A. Kraft, and Letha H. Etzkorn. Bug localiza- tion using latent Dirichlet allocation. Information and Software Technology, 52(9):972 ? 990, 2010. [50] Xin Ye, Hui Shen, Xiao Ma, Razvan Bunescu, and Chang Liu. From Word Embeddings to Document Similarities for Improved Information Retrieval in Software Engineering. In Proceedings of the 38th International Conference on Software Engineering, ICSE ?16, pages 404?415, New York, NY, USA, 2016. ACM. [51] Danilo P Mandic and Jonathon Chambers. Recurrent neural networks for prediction: learning algorithms, architectures and stability. John Wiley & Sons, Inc., 2001. [52] Felix A GERS, Ju?rgen SCHMIDHUBER, and Fred CUMMINS. Learning to forget: Continual prediction with lstm. Neural computation, 12(10):2451?2471, 2000. [53] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. Understanding the exploding gradient problem. CoRR, abs/1211.5063, 2, 2012. 134 [54] Toma?s? Mikolov, Martin Karafia?t, Luka?s? Burget, Jan C?ernocky?, and Sanjeev Khudanpur. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association, 2010. [55] Tomas Mikolov and Geoffrey Zweig. Context dependent recurrent neural net- work language model. SLT, 12(234-239):8, 2012. [56] Kyunghyun Cho, Bart Van Merrie?nboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase rep- resentations using rnn encoder-decoder for statistical machine translation. arXiv:1406.1078, 2014. [57] Hoa Khanh Dam, Truyen Tran, and Trang Thi Minh Pham. A deep lan- guage model for software code. In FSE 2016: Proceedings of the Foundations Software Engineering International Symposium, pages 1?4, 2016. [58] Nate Kushman and Regina Barzilay. Using semantic unification to generate regular expressions from natural language. In Proceedings of the 2013 Con- ference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 826?836, 2013. [59] Wang Ling, Phil Blunsom, Edward Grefenstette, Karl Moritz Hermann, Toma?s? Koc?isky?, Fumin Wang, and Andrew Senior. Latent predictor networks for code generation. In Proceedings of the 54th Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 599?609, 2016. [60] Marco Gori, Gabriele Monfardini, and Franco Scarselli. A new model for learning in graph domains. In Neural Networks, 2005. IJCNN?05. Proceed- ings. 2005 IEEE International Joint Conference on, volume 2, pages 729?734. IEEE, 2005. [61] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The Graph Neural Network Model. IEEE Transactions on Neural Networks, 20(1):61?80, January 2009. [62] Zachary Reynolds, Abhinandan Jayanth, Ugur Koc, Adam Porter, Rajeev Raje, and James Hill. Identifying and documenting false positive patterns generated by static code analysis tools. In 4th International Workshop On Software Engineering Research And Industrial Practice, 2017. [63] Mark Weiser. Program slicing. In Proceedings of the 5th international confer- ence on Software engineering, pages 439?449. IEEE Press, 1981. [64] IBM. The T.J.Watson Libraries for Analysis (WALA), 2006. http://wala. sourceforge.net/wiki/index.php, Accessed on 2018-01-04. 135 [65] Pierre-Luc Carrier and Kyunghyun Cho. Lstm Networks for Sentiment Anal- ysis, 2016. http://deeplearning.net/tutorial/lstm.html, Accessed on 2018-01-04. [66] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method. arXiv:1212.5701 [cs], December 2012. [67] Nathaniel Ayewah and William Pugh. Using Checklists to Review Static Anal- ysis Warnings. In Proceedings of the 2Nd International Workshop on Defects in Large Software Systems: Held in Conjunction with the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2009), DEFECTS ?09, pages 11?15, New York, NY, USA, 2009. ACM. [68] Tim Boland and Paul E Black. Juliet 1.1 c/c++ and java test suite. Computer, 45(10):0088?90, 2012. [69] Robert A Martin. Common weakness enumeration. Mitre Corporation, 2007. https://cwe.mitre.org, Accessed on 2018-01-04. [70] Andrej Karpathy, Justin Johnson, and Li Fei-Fei. Visualizing and Understand- ing Recurrent Networks. arXiv:1506.02078, June 2015. [71] Ugur Koc, Parsa Saadatpanah, Jeffrey S. Foster, and Adam A. Porter. Learn- ing a classifier for false positive error reports emitted by static code analysis tools. In Proceedings of the 1st ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL 2017, pages 35?42, New York, NY, USA, 2017. ACM. [72] Joana (java object-sensitive analysis) - information flow control framework for java. https://pp.ipd.kit.edu/projects/joana. [73] B. K. Rosen, M. N. Wegman, and F. K. Zadeck. Global value numbers and redundant computations. In Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ?88, pages 12?27, New York, NY, USA, 1988. ACM. [74] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated Graph Sequence Neural Networks. arXiv:1511.05493 [cs, stat], November 2015. arXiv: 1511.05493. [75] Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. Learning to Represent Programs with Graphs. arXiv:1711.00740 [cs], November 2017. arXiv: 1711.00740. [76] Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, L Sutskever, and G Zweig. word2vec. https://code.google.com/p/word2vec, 2013. [77] Yoav Goldberg and Omer Levy. word2vec explained: deriving mikolov et al.?s negative-sampling word-embedding method. arXiv:1402.3722, 2014. 136 [78] Elisa Burato, Pietro Ferrara, and Fausto Spoto. Security analysis of the owasp benchmark with julia. In Proc. of ITASEC17, the rst Italian Conference on Security, Venice, Italy, 2017. [79] Achilleas Xypolytos, Haiyun Xu, Barbara Vieira, and Amr MT Ali-Eldin. A framework for combining and ranking static analysis tool findings based on tool performance statistics. In Software Quality, Reliability and Security Companion (QRS-C), 2017 IEEE International Conference on, pages 595?596. IEEE, 2017. [80] Apollo: a distributed configuration center, 2018. https://github.com/ctripcorp/ apollo, Accessed on 2019-06-02. [81] Andreas Prlic?, Andrew Yates, Spencer E Bliven, Peter W Rose, Julius Ja- cobsen, Peter V Troshin, Mark Chapman, Jianjiong Gao, Chuan Hock Koh, Sylvain Foisy, et al. Biojava: an open-source framework for bioinformatics in 2012. Bioinformatics, 28(20):2693?2695, 2012. [82] Free chat-server: A chatserver written in java. https://sourceforge.net/projects/ freecs, Accessed on 2019-09-10. [83] Giraph : Large-scale graph processing on hadoop. http://giraph.apache.org. [84] H2 database engine. http://www.h2database.com, Accessed on 2019-06-02. [85] Apache jackrabbit is a fully conforming implementation of the content reposi- tory for java technology api. http://jackrabbit.apache.org, Accessed on 2019-06-02. [86] Hypersql database. http://hsqldb.org, Accessed on 2019-06-02. [87] Jetty: lightweight highly scalable java based web server and servlet engine, 2018. https://www.eclipse.org/jetty, Accessed on 2019-06-02. [88] Joda-time a quality replacement for the java date and time classes. http: //www.joda.org/joda-time, Accessed on 2019-06-02. [89] Java pathfinder. https://github.com/javapathfinder, Accessed on 2019-06-02. [90] Mybatis: Sql mapper framework for java. http://www.mybatis.org/mybatis-3, Ac- cessed on 2019-06-02. [91] Okhttp: An http & http/2 client for android and java applications. http: //square.github.io/okhttp, Accessed on 2019-06-02. [92] Adrian Smith. Universal password manager. http://upm.sourceforge.net, Accessed on 2019-06-02. [93] Susi.ai - Software and Rules for Personal Assistants. http://susi.ai, Accessed on 2019-06-02. 137 [94] Stephen M. Blackburn, Robin Garner, Chris Hoffmann, Asjad M. Khang, Kathryn S. McKinley, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z. Guyer, Martin Hirzel, Antony Hosking, Maria Jump, Han Lee, J. Eliot B. Moss, Aashish Phansalkar, Darko Stefanovic?, Thomas VanDrunen, Daniel von Dincklage, and Ben Wiedermann. The DaCapo Benchmarks: Java Benchmarking Development and Analysis. In Proceedings of the 21st Annual ACM SIGPLAN Conference on Object-oriented Program- ming Systems, Languages, and Applications, OOPSLA ?06, pages 169?190, New York, NY, USA, 2006. ACM. [95] Andrew Johnson, Lucas Waye, Scott Moore, and Stephen Chong. Exploring and Enforcing Security Guarantees via Program Dependence Graphs. In Pro- ceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ?15, pages 291?302, New York, NY, USA, 2015. ACM. [96] Ugur Koc and Cemal Yilmaz. Approaches for computing test-case-aware cov- ering arrays. Software Testing, Verification and Reliability, 28(7):e1689, 2018. [97] Frank Eibe, MA Hall, and IH Witten. The weka workbench. Morgan Kauf- mann, 2016. [98] Microsoft gated graph neural networks. https://github.com/Microsoft/ gated-graph-neural-network-samples, Accessed on 2018-09-02. [99] Daniel Kroening and Michael Tautschnig. Cbmc?c bounded model checker. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 389?391. Springer, 2014. [100] Dirk Beyer and Matthias Dangl. Strategy Selection for Software Verification Based on Boolean Features. In Tiziana Margaria and Bernhard Steffen, edi- tors, Leveraging Applications of Formal Methods, Verification and Validation. Verification, Lecture Notes in Computer Science, pages 144?159. Springer In- ternational Publishing, 2018. [101] Dirk Beyer. Software verification with validation of results. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 331?349. Springer, 2017. [102] Grigorios Tsoumakas and Ioannis Katakis. Multi-label classification: An overview. International Journal of Data Warehousing and Mining (IJDWM), 3(3):1?13, 2007. [103] Scott Kirkpatrick, C Daniel Gelatt, and Mario P Vecchi. Optimization by simulated annealing. science, 220(4598):671?680, 1983. [104] Peter JM Van Laarhoven and Emile HL Aarts. Simulated annealing. Springer, 1987. 138 [105] Vincent Granville, Mirko Kriva?nek, and J-P Rasson. Simulated annealing: A proof of convergence. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 16(6):652?656, 1994. [106] The llvm compiler infrastructure. https://llvm.org, Accessed on 2019-10-02. [107] Ronald Aylmer Fisher. Design of experiments. Br Med J, 1(3923):554?554, 1936. [108] John Sall, Mia L Stephens, Ann Lehman, and Sheila Loring. JMP start statis- tics: a guide to statistics and data analysis using JMP. Sas Institute, 2017. [109] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reute- mann, and Ian H Witten. The weka data mining software: an update. ACM SIGKDD explorations newsletter, 11(1):10?18, 2009. [110] Leo Breiman. Random forests. Machine learning, 45(1):5?32, 2001. [111] MBA Snousy, HM El-Deeb, K Badran, and IAA Khlil. Suite of decision tree-based classification algorithms on cancer gene expression data. egyptian informatics journal 12 (2): 73?82, 2011. [112] Changhai Nie and Hareton Leung. A survey of combinatorial testing. ACM Computing Surveys, 43:11:1?11:29, February 2011. [113] C. Yilmaz, S. Fouche?, M. B. Cohen, A. Porter, G. Demiroz, and U. Koc. Moving Forward with Combinatorial Interaction Testing. Computer, 47(2):37? 45, February 2014. [114] Mukund Raghothaman, Sulekha Kulkarni, Kihong Heo, and Mayur Naik. User-guided program reasoning using bayesian inference. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2018, pages 722?735, New York, NY, USA, 2018. ACM. [115] Abram Hindle, Earl T. Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. On the Naturalness of Software. In Proceedings of the 34th Inter- national Conference on Software Engineering, pages 837?847, Piscataway, NJ, USA, 2012. IEEE Press. [116] Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. Deep Learning Code Fragments for Code Clone Detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engi- neering, ASE 2016, pages 87?98, New York, NY, USA, 2016. ACM. [117] Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. Deep api learning. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 631?642. ACM, 2016. 139 [118] Jaroslav Fowkes and Charles Sutton. Parameter-free Probabilistic API Mining Across GitHub. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, pages 254? 265, New York, NY, USA, 2016. ACM. [119] Veselin Raychev, Martin Vechev, and Andreas Krause. Predicting Program Properties from ?Big Code?. In Proceedings of the 42Nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ?15, pages 111?124, New York, NY, USA, 2015. ACM. [120] Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. Sug- gesting Accurate Method and Class Names. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, pages 38?49, New York, NY, USA, 2015. ACM. [121] Zhaopeng Tu, Zhendong Su, and Premkumar Devanbu. On the Localness of Software. In Proceedings of the 22Nd ACM SIGSOFT International Sympo- sium on Foundations of Software Engineering, FSE 2014, pages 269?280, New York, NY, USA, 2014. ACM. [122] Tung Thanh Nguyen, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N. Nguyen. A Statistical Semantic Language Model for Source Code. In Proceed- ings of the 2013 9th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2013, pages 532?542, New York, NY, USA, 2013. ACM. [123] Veselin Raychev, Martin Vechev, and Eran Yahav. Code Completion with Sta- tistical Language Models. In Proceedings of the 35th ACM SIGPLAN Confer- ence on Programming Language Design and Implementation, PLDI ?14, pages 419?428, New York, NY, USA, 2014. ACM. [124] Miltiadis Allamanis, Hao Peng, and Charles Sutton. A Convolutional Atten- tion Network for Extreme Summarization of Source Code. arXiv:1602.03001 [cs], February 2016. [125] Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. Deep API Learning. arXiv:1605.08535 [cs], May 2016. [126] Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, and Charles Sutton. A survey of machine learning for big code and naturalness. ACM Comput. Surv., 51(4):81:1?81:37, July 2018. [127] Cedric Richter and Heike Wehrheim. Pesco: Predicting sequential combina- tions of verifiers. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 229?233. Springer, 2019. [128] Yulia Demyanova, Thomas Pani, Helmut Veith, and Florian Zuleger. Empir- ical software metrics for benchmarking of verification tools. Formal Methods in System Design, 50(2):289?316, June 2017. 140 [129] Dirk Beyer and M Erkan Keremoglu. Cpachecker: A tool for configurable software verification. In International Conference on Computer Aided Verifi- cation, pages 184?190. Springer, 2011. [130] Kihong Heo, Hakjoo Oh, and Kwangkeun Yi. Machine-learning-guided Se- lectively Unsound Static Analysis. In Proceedings of the 39th International Conference on Software Engineering, ICSE ?17, pages 519?529, Piscataway, NJ, USA, 2017. IEEE Press. [131] T. D. LaToza and A. van der Hoek. Crowdsourcing in Software Engineering: Models, Motivations, and Challenges. IEEE Software, 33(1):74?80, January 2016. [132] Richard Socher, Cliff C Lin, Chris Manning, and Andrew Y Ng. Parsing natural scenes and natural language with recursive neural networks. In Pro- ceedings of the 28th international conference on machine learning (ICML-11), pages 129?136, 2011. [133] Duc Pham and Dervis Karaboga. Intelligent optimisation techniques: genetic algorithms, tabu search, simulated annealing and neural networks. Springer Science & Business Media, 2012. [134] Edmund M Clarke Jr, Orna Grumberg, Daniel Kroening, Doron Peled, and Helmut Veith. Model checking. MIT press, 2018. [135] Pierre Wolper. Expressing interesting properties of programs in propositional temporal logic. In POPL, volume 86, pages 184?193, 1986. [136] Peter Padawitz. Computing in Horn clause theories, volume 16. Springer Science & Business Media, 2012. 141