Improving Defect Localization by Clas- sifying the Affected Asset using Machine Learning Master’s thesis in Software Engineering Sam Halali Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2018 Master’s thesis 2018 Improving Defect Localization by Classifying the Affected Asset using Machine Learning Sam Halali Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2018 Improving Defect Localization by Classifying the Affected Asset using Machine Learning SAM HALALI © SAM HALALI, 2018. Supervisor: Miroslaw Staron, Department of Computer Science and Engineering Examiner: Jan-Philipp Steghöfer, Department of Computer Science and Engineer- ing Master’s Thesis 2018 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2018 iv Improving Defect Localization by Classifying the Affected Asset using Machine Learning SAM HALALI Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract Today’s market demands complex large-scale software to be developed and deliv- ered at an increased pace. The increase in software complexity increases the cost of maintenance which on average accounts for 60 percent of software costs. Correc- tive maintenance accounts for 21 percent of the maintenance costs which includes receiving a defect report describing a defect, diagnosing and removing the described defect. A vital part of a defect’s resolution is the task of defect localization. Defect localization is the task of finding the exact location of the defect in the system. The defect report, in particular the asset attribute, help the assigned entity to limit the search space when investigating the exact location of the defect. However, research has shown that oftentimes reporters initially assign values to these attributes that provide incorrect information. In this thesis, using machine learning to classify the source asset for a given de- fect report at a telecom company was evaluated. Following design science research, two iterations were conducted. The first iteration evaluated classification models for classifying the source asset after submission of a defect report. By training a SVM with features constructed from both categorical and textual attributes of the defect reports an accuracy of 58.52% was achieved. The second iteration evaluated classification models for providing the reporter with recommendations of likely as- sets. By using recommendations provided by a SVM trained with features from both categorical and textual attributes of the defect reports the precision could be significantly increased. Keywords: Machine Learning, Defect Localization, Defect predictions, Supervised Learning, Text Classification, Recommendation Systems v Acknowledgements I would like to thank Miroslaw Staron, my supervisor, for providing valuable feed- back and suggestions throughout the thesis process. I would also like to thank the telecom company at which this thesis was conducted. Finally, I would like to thank a long list of people providing me with motivation, help and joy. Thank you: Micael Caiman, WilhelmMeding, Per Sundvall, Mahmoud Halali, Maria Ahmadikhatir, Emma Ahlberg, Fredrik Rahn, Simon Kindström, Marko Solunac, Patrik Olsson, Michaela Fritiofsson and Jenin Grill. Sam Halali, Gothenburg, June, 2018 vii Contents List of Figures xi List of Tables xiii 1 Introduction 1 1.1 Problem Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Purpose of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Limitations and Delimitations . . . . . . . . . . . . . . . . . . . . . . 2 2 Background 3 2.1 Defects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.3 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . 10 2.2.5 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . 12 2.2.6 K-Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.7 Performance Evaluation of Classification Models . . . . . . . . 14 2.2.8 Pitfalls in Supervised Learning . . . . . . . . . . . . . . . . . 16 2.2.8.1 Performance Validation of Classification Models . . . 16 2.2.8.2 Missing Attribute Values . . . . . . . . . . . . . . . . 16 2.2.8.3 Class Imbalance Problem . . . . . . . . . . . . . . . 17 2.2.8.4 Overfitting . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.8.5 Feature Engineering . . . . . . . . . . . . . . . . . . 18 3 Related Work 21 3.1 Defect Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Duplicate Defect Report Detection . . . . . . . . . . . . . . . . . . . 22 3.3 Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.4 Defect Prediction In Industry . . . . . . . . . . . . . . . . . . . . . . 23 4 Research Design 25 4.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 26 ix Contents 4.2.1 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2.2 Development Setup . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2.4 Feature engineering . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2.5 Iteration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2.6 Iteration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5 Results 33 5.1 Iteration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.2 Iteration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6 Discussion 39 6.1 Iteration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 6.2 Iteration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 6.3.1 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . 42 6.3.2 Tuning Classification Algorithms . . . . . . . . . . . . . . . . 42 6.4 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 6.4.1 Conclusion Validity . . . . . . . . . . . . . . . . . . . . . . . . 42 6.4.2 Internal Validity . . . . . . . . . . . . . . . . . . . . . . . . . 42 6.4.3 Construct Validity . . . . . . . . . . . . . . . . . . . . . . . . 43 6.4.4 External Validity . . . . . . . . . . . . . . . . . . . . . . . . . 43 7 Conclusion 45 x List of Figures 2.1 The life cycle of a defect © 1993 IEEE . . . . . . . . . . . . . . . . . 4 2.2 The relationship between defects and other entities in maintenance © 1993 IEEE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.3 Example view of when reporting a defect. . . . . . . . . . . . . . . . . 5 2.4 The life cycle of a defect report at the telecom company . . . . . . . 5 2.5 A Plot of the function seen in Formula 2.11 and the data points seen in Table 2.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.6 Example of a decision tree . . . . . . . . . . . . . . . . . . . . . . . . 10 2.7 Example of possible hyperplanes for data linearly separable data seen in Table 2.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.8 An artificial neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.9 Example of a complex overfitted model. . . . . . . . . . . . . . . . . . 18 4.1 Design science research methodology as explained by Hevnet et al. [46] 26 4.2 Classification scheme for Iteration 1 . . . . . . . . . . . . . . . . . . . 29 4.3 Classification scheme for Iteration 2 . . . . . . . . . . . . . . . . . . . 31 5.1 Resulting (a) recall@n and (b) precision@n for recommendation model trained with features constructed from both textual and categorical attributes of defect reports from Product 1 . . . . . . . . . . . . . . . 37 5.2 Resulting (a) recall@n and (b) precision@n for recommendation model trained with features constructed from both textual and categorical attributes of defect reports from Product 2 . . . . . . . . . . . . . . . 38 xi List of Figures xii List of Tables 2.1 The attributes of a defect © 1993 IEEE . . . . . . . . . . . . . . . . 3 2.2 Description of the different variable types . . . . . . . . . . . . . . . . 6 2.3 Example of categorical defect report data with an unseen entry used for a Naive Bayes model. . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 Example of server metric data which are labeled with whether the server is overloaded or not. . . . . . . . . . . . . . . . . . . . . . . . . 9 2.5 Example data set used for training a decision tree model . . . . . . . 10 2.6 Example of server metric data which are labeled with whether the server is overloaded or not. . . . . . . . . . . . . . . . . . . . . . . . . 11 2.7 Categorical defect report data with unseen entry x . . . . . . . . . . 13 2.8 The categorical defect report seen in Table 2.7 that has been encoded using the one-hot method. . . . . . . . . . . . . . . . . . . . . . . . . 14 2.9 Example of nominal defect data . . . . . . . . . . . . . . . . . . . . . 19 2.10 Example of features constructed from nominal defect data from Table 2.9 using one-hot encoding . . . . . . . . . . . . . . . . . . . . . . . . 19 2.11 Example of human-written defect description. . . . . . . . . . . . . . 19 2.12 Example of features constructed from human-written text using one- hot encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.13 Example of features constructed from human-written text using a simple bag of words model . . . . . . . . . . . . . . . . . . . . . . . . 20 4.1 The mandatory attributes of the defect reports that were provided by the telecom company. . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2 The number of possible assets for each product. . . . . . . . . . . . . 27 4.3 Descriptions of the evaluated classification algorithms. . . . . . . . . . 28 4.4 Descriptions of the different setups of the DummyClassifier used in this thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.5 The benchmark values by setup for Product 1. . . . . . . . . . . . . . 31 4.6 The benchmark values by setup for Product 2. . . . . . . . . . . . . . 31 4.7 Example of recommendations provided by a system that recommends three assets for a given defect report. . . . . . . . . . . . . . . . . . . 32 4.8 The benchmark values of the recommendation system for each product. 32 5.1 Results for models trained with features constructed from the cate- gorical attributes of defect reports from Product 1 . . . . . . . . . . . 33 xiii List of Tables 5.2 Results for models trained with features constructed from the cate- gorical attributes of defect reports from Product 2 . . . . . . . . . . . 34 5.3 Results for models trained with features constructed from the textual attributes of defect reports from Product 1 . . . . . . . . . . . . . . . 34 5.4 Results for models trained with features constructed from the textual attributes of defect reports from Product 2 . . . . . . . . . . . . . . . 34 5.5 Results for models trained with features constructed from both the textual and categorical attributes of defect reports from Product 1 . . 34 5.6 Results for models trained with features constructed from both the textual and categorical attributes of defect reports from Product 2 . . 35 5.7 Results for recommendation model trained with features constructed from categorical attributes of defect reports from Product 1 . . . . . 35 5.8 Results for recommendation model trained with features constructed from categorical attributes of defect reports from Product 2 . . . . . 36 5.9 Results for recommendation model trained with features constructed from textual attributes of defect reports from Product 1 . . . . . . . 36 5.10 Results for recommendation model trained with features constructed from textual attributes of defect reports from Product 2 . . . . . . . 36 5.11 Results for recommendation model trained with features constructed from both the textual and categorical attributes of defect reports from Product 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.12 Results for recommendation model trained with features constructed from both the textual and categorical attributes of defect reports from Product 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 6.1 The benchmark values by setup for Product 1. . . . . . . . . . . . . . 39 6.2 The benchmark values by setup for Product 2. . . . . . . . . . . . . . 39 6.3 The best performing classification model for each product. . . . . . . 40 6.4 The benchmark values of the recommendation system for each product. 40 6.5 The Recall@n of the best performing classifier of each product. . . . . 41 6.6 The Precision@n of the best performing classifier of each product. . . 41 6.7 The F1@n of the best performing classifier of each product. . . . . . . 41 6.8 The benchmark values of the recommendation system for each product. 41 xiv 1 Introduction 1.1 Problem Background Today’s market demands complex large-scale software to be developed and deliv- ered at an increased pace. The increase in software complexity increases the cost of maintenance [1] which on average accounts for 60 percent of software costs [2]. Corrective maintenance accounts for 21 percent of the maintenance costs which in- cludes receiving a defect report, diagnosing and removing it. In most cases, when a defect is found it is initially documented, assigned to a team and then submitted into a Defect Tracking System (DTS). A defect report consists of several attributes that are vital during the life cycle of a defect. Among these attributes are the textual description of the defect, the as- set attribute, which indicates the system part containing the defect, and the severity attribute which indicates the highest impact the defect could or did cause. These attributes are both utilized when assigning the task of removing the defect to a team and when the assigned team investigates the location of the defect and the corrective measures for removing it. A vital part of a defect’s resolution is the task of defect localization. Defect lo- calization is the task of finding the exact location of the defect in the system. This task relies on the developer’s expertise of the system and ability to identify and prioritize, during investigation, assets which may contain the defect [3]. The defect report, in particular the asset attribute, should help the assigned entity to limit the search space when investigating the exact location of the defect. However, research has shown that oftentimes reporters initially assign values to these attributes that provide incorrect information [4][5]. A study conducted at Microsoft by Guo et al. [6] showed that oftentimes the re- porter of the defect does not have the required expertise for identifying the asset containing the defect which is one of the main reasons behind the attribute being assigned an incorrect value. The issue of incorrect values is also reflected in large- scale organizations, other than Microsoft, with many mission-critical parts, where a delay can have big impact on cost and customer perception. Furthermore, in or- ganizations that have dedicated development teams, responsible for specific assets, the defect assignment process is intertwined with the task of defect localization [7]. In such organizations an incorrect asset value may also delay the defect assignment. 1 1. Introduction 1.2 Problem Statement The problem addressed in this thesis is that the asset attribute of a defect report is often assigned an incorrect value. In large software organizations, it is common that the reporter of the defect does not have the required expertise for identifying the asset containing the defect. This results in the asset attribute being assigned an incorrect value, which results in unnecessary delayed defect resolution which has a large impact on maintenance costs, the speed of software development and the quality of the final product. 1.3 Purpose of the Study The purpose of this thesis to improve defect localization by classifying the correct asset from which a defect originates by using machine learning. The intention is to build a classification model, based on the results of previous research, in an industrial context. In further detail, the aim is to reduce the defect localization time by providing a more accurate recommendation of which asset that contains the defect. This will aid the assigned entity to limit the search space when investigating the exact location of the defect. 1.4 Limitations and Delimitations Like every study, this study is also subject to limitations. One limitation is that defect reports of a telecom company will be used. The attributes of the defect re- port that will be considered is a subset of those specified by IEEE 1044 [8] that are recorded upon initial documentation of a defect. These attributes include De- scription, Artifact, Severity and Detection Activity and will be described in further detail in Chapter 2. Therefore the study will only be reproducible given that the previously mentioned attributes exists in the used defect reports. One delimitation is that only data from defect reports will be used. Other re- search in the topic has shown that data from other sources, such as version control systems, can be utilized upon prediction [9]. These sources could contain useful features, such as affected files and commit messages, for classifying the asset but will not be considered during the study due to the time constraint of this thesis. 2 2 Background 2.1 Defects According to [8] a defect is: "An imperfection or deficiency in a work product where that work product does not meet its requirements or specifications and needs to be either repaired or replaced.". When a defect is detected its attributes are classi- fied and documented in a defect report. The attributes of a defect can be seen in Table 2.1, where the names of the attributes might differ depending on the orga- nization. Furthermore, the values of some attributes are added and changed over time as the organization addresses the defect. The recorded attributes of a defect Attribute Definition Defect ID Unique identifier for the defect. Description Description of what is missing, wrong, or unnecessary. Status Current state within defect report life cycle. Asset The software asset (product, component, module, etc.) containing the defect. Artifact The specific software work product containing the defect. Version detected Identification of the software version in which the defect was detected. Version corrected Identification of the software version in which the defect was corrected. Priority Ranking for processing assigned by the organization responsible for the evaluation, resolution, and closure of the defect relative to other reported defects. Severity The highest failure impact that the defect could (or did) cause, as determined by (from the perspective of) the organization responsible for software engineering. Probability Probability of recurring failure caused by this defect. Effect The class of requirement that is impacted by a failure caused by a defect. Type A categorization based on the class of code within which the defect is found or the work product within which the defect is found. Mode A categorization based on whether the defect is due to incorrect implementation or representation, the addition of something that is not needed, or an omission. Insertion activity The activity during which the defect was injected/inserted (i.e., during which the artifact containing the defect originated). Detection activity The activity during which the defect was detected (i.e., inspection or testing). Failure reference(s) Identifier of the failure(s) caused by the defect. Change reference Identifier of the corrective change request initiated to correct the defect. Disposition Final disposition of defect report upon closure. Table 2.1: The attributes of a defect © 1993 IEEE serve both the reporter and the receiver, who is responsible for removing the defect. Therefore defect reports are usually stored in a database known as Defect Tracking System (DTS). The DTS is used by employees of the organization, both engineers and managers, to understand the defect and follow it’s resolution process. When a defect has been documented and classified, it is assigned to an entity, for instance 3 2. Background a development team, that is capable of removing the defect. Once the corrective measures has been taken to resolve the defect, the changes are added to a planned release from which the defect is removed. The defect life cycle is depicted in Figure 2.1 and the relationship between defects and several conceptual entities is shown in Figure 2.2. Figure 2.1: The life cycle of a defect © 1993 IEEE Figure 2.2: The relationship between defects and other entities in maintenance © 1993 IEEE At the telecom company where this study was conducted, defects are detected dur- ing several stages of the life cycle of their products. Defects can for instance be detected by end-users or during the internal testing of the product. When defects are detected by an end-user, the unit for customer support is contacted. Customer support investigates the reported problem and determines if the problem occurred due to a defect in the product. If customer support determines that the problem occurred due to a defect a defect report is created and submitted in the company’s DTS. Regardless of who the reporter is a web-portal is used to record the attributes of and describe the defect. The web-portal looks similar to the one shown in Figure 2.3. The attributes shown in Figure 2.3 are the attributes that are mandatory to record before submitting the defect report. The attributes are: Artifact, Detection Activity, Severity, Asset and Description. 4 2. Background Figure 2.3: Example view of when reporting a defect. The life cycle of a submitted defect report at the telecom company can be seen in Figure 2.4. When the defect report has been submitted it is assigned to team capable of removing it. The assignment process varies depending of what type of defect it is. Sometimes there are units responsible for certain assets of the product. These units receive all the submitted defect reports regarding the asset that they are responsible for and decide which team will be responsible for investigating and removing the defect. Figure 2.4: The life cycle of a defect report at the telecom company Once the defect report has been assigned, it is investigated. The ideal outcome of the investigation is that a corrective measure is suggested which leads to the re- 5 2. Background moval of the defect. However, this is not always the case. Oftentimes, the attributes of the defect are assigned incorrect values. If the incorrect values are detected by the assigned team, the team will suggest changes in the defect report which can lead to a reassignment of the defect report. 2.2 Machine Learning The goal of Machine Learning is to develop computational functions that learn through accumulated experiences [10]. The two most common methods of learning are: Unsupervised Learning and Supervised Learning. In unsupervised learning the goal is to learn structures of data sets without an identified output[11]. For instance, given a data set containing 100 defect reports originating from 5 different assets, the data needs to be grouped based on the as- sets without any explicit indicator of which asset the defect originates from. In the absence of labels, which are explicit indicators of what group each data point be- longs to, the groups are learned by maximizing the similarities of the entries within a group and minimizing the similarities between the groups. One of the reasons behind the growth of the learning method is that, in certain problem domains, large unlabeled data sets are available but labeling each entry would be time consuming and require domain expertise [12]. For instance, when developing a recommenda- tion system for a news-platform, a large number of articles would be available but manually labeling each article as economy, sports or politics would be impractical. The goal of supervised learning is to train a model to predict one or more outputs for unseen inputs, given that there is a functional relationship between the inputs and the outputs [13]. This is achieved by training the model with a labeled data set which is a data set that contains correct outputs given certain inputs. According to Louridas et al.[14] this can be compared to giving a student a set of problems with their respective solutions for learning how to solve future unseen problems. The output variables to be predicted can be either categorical variables or numerical variables [13]. Categorical variables can be either nominal, ordinal or dichotomous. Numerical variables can be either discrete or continuous and can take on values that are obtained through measuring or counting. Descriptions of the different variable types can be seen in Table 2.2. Variable Type Description Example Nominal A categorical variable that has more than two possible values which does not have a natural order. Asset = {UI, Login, V alidator} Ordinal A categorical variable which values have a natural order. Severity = {A,B,C} Dichotomous Like a nominal variable with only two possible values. State = {Open,Closed} Discrete A numerical variable that can take a value from a set of whole numbers. #Lines Affected={1, 2, 3, ...} Continuous A numerical variable that can take any value between a set of real numbers. Time={2018-04-06 04:01, .... , 2018-04-06 04:01:02,....} Table 2.2: Description of the different variable types 6 2. Background The task of predicting an output variable varies depending on the type of the vari- able. If the output is a numerical variable, as when trying to predict the number of defects reported per week, the problem is a regression problem. However, if the output is a categorical variable, as when trying to determine the severity of a defect, the problem is a classification problem. As the intention of this thesis is to build a model for determining the asset from which a defect originates, which is a nominal variable, the presented algorithms will be discussed in the context of classification. Furthermore, since all the provided defect data is labeled only supervised learning algorithms will be presented. 2.2.1 Naive Bayes Naive Bayes is a probabilistic classification algorithm that assigns an input with the label that is most likely for that specific input [15]. The algorithm is based on Bayes Theorem which can be seen in Equation 2.1 where P (A|B) can be read as the probability of A given that event B has occurred. P (A|B) = P (B|A)P (A) P (B) = P (A ∩B) P (B) (2.1) When classifying the unseen entry seen in Table 2.3, Bayes Theorem would be used as seen in Equation 2.2 where y is the classified label. Calculating the conditional probability P (Login|A ∩ Customer) would be calculated as seen in Equation 2.3. Severity Detection Activity Asset A Customer - A Customer Login A System Test Validator Table 2.3: Example of categorical defect report data with an unseen entry used for a Naive Bayes model. y = argmax P (Login|A ∩ Customer), P (V alidator|A ∩ SystemTest) (2.2) P (Login|A ∩ Customer) = P (A ∩ Customer|Login)P (Login) P (A ∩ Customer) (2.3) Since Naive Bayes makes the naive assumption that all inputs are independent, the calculation of the conditional probability is simplified as seen in Equation 2.4. Furthermore the denominator is excluded from the calculation since it is constant when comparing different conditional probabilities for the same inputs. P (Login|A ∩ Customer) = P (A|Login)P (Customer|Login)P (Login) P (A)P (Customer) (2.4) 7 2. Background Now the classification task can be simplified as seen in Equation 2.5, where X is the vector of inputs {x1, ..xn} and Ck is the target variable with the possible values {C1, ...Cp}. y = argmax k∈{1,..p} P (Ck) n∏ i=1 P (xi|Ck) (2.5) A Naive Bayes model, trained with the labeled data seen Table 2.3, would classify the unseen entry as Login. This is because the entry has a greater likelihood of belonging to class Login, as seen in Equation 2.6 and 2.7. P (Login)P (A|Login)P (Customer|Login) = 0.5 · 1 · 1 = 0.5 (2.6) P (V alidator)P (A|V alidator)P (Customer|V alidator) = 0.5 · 1 · 0 = 0 (2.7) Naive Bayes offers fast training, implementation and classification [16][15]. However, it usually has a lower accuracy in interdependent data sets, compared to other clas- sification algorithms, due to the assumption of independence amongst the features. Furthermore the model offers interpretable insights in both decision-making when classifying and knowledge gained during training. 2.2.2 Logistic Regression Logistic regression is a classification method borrowed from the field of statistics. The method combines the logistic sigmoid function seen in Formula 2.8 and a linear model for classifying entries using the model seen in Formula 2.9 [17]. f(x) = 1 1 + e−x (2.8) P (x) = 1 1 + e−(β0x+β1) (2.9) When predicting a binary target variable, the output of the function can be seen as P (x) = P (Y = A|X) where X is labeled A if P (X) > 0.5. During training the slope β0 and intercept β1, seen in Formula 2.9, are decided by maximizing the likelihood function seen in Formula 2.10. `(β0, β1) = ∏ i:yi=1 p(xi) ∏ i′:yi′ =0 (1− p(xi′)) (2.10) When using categorical variables each unique value adds another dimension. There- fore, to provide an intuitive example, a different example than the example used for describing Naive Bayes is provided. Given the data seen in Table 2.4, the logistic regression model would learn the intercept and slope for the sigmoid function seen in Formula 2.11. Furthermore a plot of the function together with the data used for training can be seen in Figure 2.5. When classifying the unseen entry x seen in Table 2.4, the model would label the entry, using the learned function, as Overload = NO since P (2.7) ≈ 0.19 < 0.5. P (x) = 1 1 + e−(2.8329x+9.1050) (2.11) 8 2. Background ID Avg Sent (Mbit/s) Overload x 2.7 - 1 0.5 NO 2 1.5 NO 3 2 NO 4 4.5 YES 5 5 YES 6 3.5 YES 7 3 NO 8 3 YES 9 3.5 NO 10 4 YES Table 2.4: Example of server metric data which are labeled with whether the server is overloaded or not. Figure 2.5: A Plot of the function seen in Formula 2.11 and the data points seen in Table 2.4. Similar to Naive Bayes, Logistic Regression provides interpretable classifications since the output of the sigmoid function can be interpreted as a probability of class membership. Furthermore Logistic Regression offers fast classification and makes no assumptions of independence amongst the features. However, the algorithm is limited to linearly separable data since the decision boundary can be seen as linear. 2.2.3 Decision Trees Decision trees is a classification algorithm that uses a tree of logical decisions based on the attributes of the data to predict a label [15]. Much like a flowchart, the data to be classified starts at the root node and traverses down the tree based on the attributes of the data. At each branch of the tree there is a decision node which indicates a decision to be made based on a attribute. The leaf node, which is also known as the terminal node, indicates the classified label of the data. The data set shown in Table 2.5 consists of 4 entries containing categorical de- fect report data. Assume this data was used to train a decision tree model with the field Asset as the target variable. The resulting decision tree would look like Figure 2.6. One of the main advantages of using decision trees is the interpretability of the classifications [18]. Assume the decision tree is used to classify the entry with ID = 9 2. Background ID Severity Detection Activity Asset 1 A Customer Login 2 B System Test Validator 3 B Customer Login 4 A System Test Load Balancer Table 2.5: Example data set used for training a decision tree model Figure 2.6: Example of a decision tree 3, the model would classify the asset as Login. If someone would ask why the asset is classified as such the answer would be "because the Detection Activity is Customer". This type of insight in the classification is not common amongst the different classification algorithms and is viewed as one of decision trees strong suits. Furthermore decision trees have a fast classification phase since an unseen entry only traverses down the tree based on the path decided by the decision nodes, until a terminal node is reached[15]. However, decision trees are both sensitive to noise and redundant features[16] which can lead to overly complex trees. 2.2.4 Support Vector Machines A Support Vector Machine (SVM) is a classification method that constructs a linear boundary, called a hyperplane, that separates the entries by their labels into two partitions [15] [19]. The class of an unseen entry is decided based on which partition the entry is located in. Similar to when using Logistic Regression, using categorical inputs adds another dimension for each unique value of the variable. Therefore, the problem of classify- ing if a server is overloaded is used as an example. The provided training data and the unseen entry x can be seen in Table 2.6. Since the data is linearly separable, a SVM would identify infinite number of possible hyperplanes during training. Three of the possible hyperplanes, denoted as A, B and C, are shown in Figure 2.7. 10 2. Background The problem of deciding between different possible hyperplanes is solved by search- ID #Users (*1000) Avg Sent (Mbit/s) Overload x 1.5 5 - 1 2 1 NO 2 1 2.5 NO 3 3 4 YES 4 4 3.5 YES 5 4 5 YES 6 1 1.5 NO 7 2 3 NO 8 3 3.5 YES 9 1.5 2 NO 10 3.5 4.5 YES Table 2.6: Example of server metric data which are labeled with whether the server is overloaded or not. Figure 2.7: Example of possible hyperplanes for data linearly separable data seen in Table 2.6. ing for the Maximum Margin Hyperplane (MMH). The MMH is the hyperplane that creates the greatest separation between the two classes and is most likely generalize best to future unseen data [15]. Furthermore, the hyperplane is represented by a set of support vectors, which are the points that lie on the hyperplane’s maximum margin. Given the hyperplanes provided in Figure 2.7, the SVM would determine that B is the MHH and use it as a boundary between the partitions. The coor- dinates of the unseen entry x, seen in Table 2.6, would be (1.5, 5) which is above the partition boundary B. Therefore the entry would be labeled as Overload = YES. SVMs can also be applied to data that is not linearly separable. This can be done by 11 2. Background using either a slack variable or the kernel trick. The slack variable allows misclassi- fication of entries for a cost C per entry. Now instead of searching for the MMH the goal of the algorithm is to minimize the total cost. The kernel trick is the process of using kernel functions to transform the data to a higher dimension space, to trick the algorithm to view the data as linearly separable. The kernel function adds new features based on mathematical relationships of the existing features. Overall SVMs have been used in a wide variety of domains where they have provided highly accurate models. Furthermore, the classification phase of SVMs is considered to be fast [16] [15]. This is because the label of an unseen entry is determined by its relation to the MMH. However, finding the MMH comes with great computational cost which slows down the training phase significantly. 2.2.5 Artificial Neural Networks Artificial Neural Networks (ANNs) originate from attempting to represent the hu- man brain as a mathematical model [20] [21] [15] and are used to solve a variety of problems in both supervised and unsupervised learning. The artificial neuron as seen in Figure 2.8 behaves much like a biological neuron. The neuron receives a set of input signals denoted as xi which are multiplied by their respective weights wi and summed to a single value. The total is then used by the activation function denoted as f which results in the signal y(x). The formal definition of a artificial neuron can be seen in Formula 2.12. Figure 2.8: An artificial neuron y(x) = f( n∑ i=1 wixi) (2.12) The main characteristics of a ANN is the activation function, the network topology and the training algorithm. The activation function calculates the output of the neuron which is forwarded to the other connected neurons. The most simple acti- vation function is the threshold activation function. A typical threshold activation function outputs a signal when the sum of inputs reach a certain threshold and does nothing otherwise. However because of the discrete nature of a threshold activa- tion function the most common activation function is the logistic sigmoid activation function which shares a similar shape but is continuous. 12 2. Background The network topology of an ANN is defined by the number of layers, the allowed directions of dataflow and the number of nodes in each layer. The capabilities of an ANN of finding subtle patterns in complex data sets is generally determined by the size of the network and the composition of nodes. In ANNs data can be allowed to flow backwards which can enable the network to learn patterns in sequences of events over time. The weights of the inputs is determined during the training phase with the use of some training algorithm with most common one being backpropagation. 2.2.6 K-Nearest Neighbors Given a training data set, the Nearest Neighbor algorithm assigns an unseen entry x, the same label as the training entry that is nearest given a distance metric d [15][11]. There are several distance metrics used for determining the nearest neighbor, such as the manhattan distance and the euclidean distance for continuos features and the hamming distance[22] for nominal features. To avoid misclassification in noisy do- mains, more than one of the nearest neighbors are used for deciding the label which is why the algorithm is most commonly known as k-Nearest Neighbors (kNN). When classifying the label of entry x seen in Table 2.7, given the training examples below x, the algorithm would: 1. Calculate the distance (hamming distance) between x and the training exam- ples. 2. Identify the k entries that closest to x. 3. Identify the most common label c amongst the k entries. 4. Label x as c. Assume k = 3, given the data in Table 2.7 that has been encoded to the data seen in Table 2.8, the algorithm would identify the entries 1, 3 and 5 as the k-nearest neighbors. The most common label amongst the nearest neighbors is Login, there- fore the entry x would be labeled as Login. ID Severity Detection Activity Asset x C Customer - 1 A Customer Login 2 B System Test Validator 3 B Customer Login 4 A System Test Load Balancer 5 C Function Test Input Parser Table 2.7: Categorical defect report data with unseen entry x The k-Nearest Neighbors algorithm has a fast training phase but is slow when clas- sifying new entries [15][16]. During the training phase, the algorithm does not learn from the training data, instead it stores it in memory which makes the training phase inherently fast. This is undesirable when dealing with large amounts of data 13 2. Background ID A B C Customer System Test Function Test Asset Hamming Distance x 0 0 1 1 0 0 - 1 1 0 0 1 0 0 Login 2 2 0 1 0 0 1 0 Validator 4 3 0 1 0 1 0 0 Login 2 4 1 0 0 0 1 0 Load Balancer 4 5 0 0 1 0 0 1 Input Parser 2 Table 2.8: The categorical defect report seen in Table 2.7 that has been encoded using the one-hot method. since the storage requirement scales with the number of training examples. The slow classification of an unseen entry is due to the great computational cost that comes with calculating the distance between the entry and all training examples, and finding the k examples with the shortest distance to the entry. 2.2.7 Performance Evaluation of Classification Models There are several ways of evaluating the performance of a trained classifier. The most common measures of performance is the error rate and accuracy [18]. The accuracy of a trained classifier is derived by dividing the correctly labeled entities with the total number of labeled entities. The error rate is derived by dividing the incorrectly labeled entities with the total number of labeled entities. This metric works well for balanced data sets however, it is not an accurate measure of perfor- mance for unbalanced data sets. Assume that we are predicting a output variable X. The variable can take on two values, A and B. The distribution of the data is unbalanced, 98% of the values are labeled A and 2% of the values are labeled B. A classifier that always predicts A would have an accuracy of 98% however most people would agree that this classifier is of no use. If the class distribution is unbalanced, the accuracy metric needs to be comple- mented or substituted by other metrics, such as precision and recall. Precision is calculated by dividing the number of correct predictions of a class by the number of times the class was predicted[18]. The precision formula for a class can be seen in Formula 2.13, where true positives are denoted as TP and false positives as FP . Futhermore the precision metric of a class A, can be interpreted as the probability of the classifier being correct, when labeling an unseen entry as A. Precision = Pr = TP TP + FP (2.13) The precision of a class A, given a data set with 98 entries labeled B, 2 entries labeled A and a classifier that always labels entries as B, except one correct prediction of A, would be 1. Therefore the precision measure is complemented with the recall metric, which measures the probability of the classifier recognizing the class A given an instance of the class A. Recall is calculated by dividing the number of correct 14 2. Background predictions of the class by the number of instances of the class[18]. The formula of the recall metric can be seen in Formula 2.14, where false negatives are denoted as FN . The measures recall and precision can be combined into a single metric called the F1-score, which is the harmonic mean of both measures. The formula of the F1-score can be seen in Formula 2.15. Recall = Re = TP TP + FN (2.14) F1 = 2 · Pr ·Re Pr +Re (2.15) To aggregate a metric that has been measured for all classes, to a single value, two different methods of averaging can be used, namely micro-averaging and macro- averaging. The micro-average of a metric is calculated by dividing the sum of numerators with the sum of the denominators of the measurements for each class. An example of micro-averaging the precision metric for the classes A and B, can be seen in Formula 2.16 given the measures seen in Formula 2.17. This averaging method is biased towards the most populated class. Pr = TPA + TPB TPA + FPA + TPB + FPB (2.16) PrA = TPA TPA + FPA PrB = TPB TPB + FPB (2.17) The macro-average is calculated by dividing the sum of the class metrics by the number of classes. Given the measures seen in Formula 2.17, the macro-average of the precision metric for the classes A and B would be expressed as seen in Formula 2.18. This averaging method is useful for unbalanced data sets since it weighs each class equally. Pr = PrA + PrB 2 (2.18) Another metric that has been used to evaluate the performance of classification models, on unbalanced data sets, is the Matthews Correlation Coefficient (MCC) [23][24]. The metric’s value ranges between -1 and 1. The value indicates the degree of correlation between the predictions and the results, where 1 indicates complete correlation, 0 indicates no correlation and -1 indicates negative correlation. The formula for the Matthews Correlation Coefficient can be seen in Formula 2.19. MCC = TP · TN + FP · FN√ (TP + FP )(TP + FN)(TN + FP )(TN + FN) (2.19) Furthermore the Matthews Correlation Coefficient has been generalized to cover cases where the target variable is non-binary. Given a K ×K Confusion Matrix, as 15 2. Background seen in Formula 2.20, where the rows indicate the predicted class and the columns indicate the actual class, the metric can be formulated as seen in Formula 2.21. K ×K Confusion Matrix =  C11 C12 . . . ... . . . Ck1 Ckk  (2.20) MCC = ∑ k ∑ l ∑ mCkkClm − CklCmk√∑ k( ∑ l Ckl)( ∑ k′|k′ 6=k ∑ l′ Ck′l′) √∑ l Clk)( ∑ k′|k′ 6=k ∑ l′ Cl′k′) (2.21) 2.2.8 Pitfalls in Supervised Learning When developing a classification model there are a number of pitfalls that needs to be taken into consideration. This section will describe the different pitfalls and how to avoid them. 2.2.8.1 Performance Validation of Classification Models As the goal of a classification model is to adequately classify a target variable for unseen entries, the classifier should be properly evaluated. Other than deciding ap- propriate performance measures for the problem domain it is important to consider how the measures translate beyond the training data. Evaluating the model on the training data is misleading since performing well on the training data is easy [25]. For instance, a classifier that memorizes the training data will have a accuracy of 100% when evaluated on the training data. Therefore, it is important to evaluate the model on data that has not been used for training the model. This is done by splitting the data set into a test set and a training set so that the model is evaluated on unseen data. To avoid the influence of the selection of the test set k-Fold Cross-Validation is used. In k-Fold Cross-Validation the data is split into k subsets of same size which is used by the learning algorithms k times. Lets denote the set of subsets {x1, x2, ..., xk} as X. For each run, from i = 1 to i = k, the algorithm uses all subsets {xj ∈ X|j 6= i} for training and xi for testing. The final score, used for evaluating the model, is the average of all test scores. Furthermore, if the data set share the same class distri- bution as the data encountered in the problem domain, randomly splitting the data can yield a pessimistic performance estimate [26]. In such cases, each fold can be stratified to maintain the same class distribution as the entire data set. For instance, when splitting a data set that contains 90 instances of class A and 10 instances of class B into 10 stratified folds, each fold would contain 9 instances of class A and 1 instance of class B. Using stratified folds with K-fold Cross-Validation is known as Stratified K-Fold Cross-Validation. 2.2.8.2 Missing Attribute Values In many data sets the values of some attributes are missing, which can cause prob- lems for certain classification algorithms. For instance, given an entry with one or 16 2. Background more missing values, an artificial neuron will fail to calculate the input of the acti- vation function ∑ iwixi. The entries with missing attribute values can be removed, but in some data sets this will result in removing a majority of the data set [18]. The other option is to fill in the missing values which can be done using different strategies[11]. One strategy is to decide the value based on the observed values of the attribute. This can be done by choosing the most frequent value, the average of all values or a random value chosen from the attribute’s distribution. However, this can be mislead- ing especially in cases where the attributes are dependent. For instance, when given a defect report with the values Status = Open where Status ∈ {Open,Closed} and Action = ? where Action ∈ {Pending, F ixed}. Setting the Action attribute to Fixed would confuse the model as this data point would be considered as noise. Another strategy is to train a model to determine the values of the examples with missing attribute values. This is done training a model with all examples that have values for the attribute of interest. When the model has been trained, the values of the examples with missing values are determined by the model. This adds another layer of complexity since each attribute with missing values adds a new machine learning problem. 2.2.8.3 Class Imbalance Problem A data set where the distribution of classes is unbalanced is said to suffer the Class Imbalance Problem [11]. For instance, when given a training set for classifying ma- licious network requests, the class distribution will most likely be unbalanced. As the imbalance of the class distributions grows, the accuracy when classifying the minority class decreases because of the model’s bias towards the majority class[18]. Given a classification task where the goal is to maximize the accuracy and where the provided training and test data share the same class distribution, the class imbal- ance problem is not deemed as meaningful[11]. However, when classifying malicious network requests, the minority class is of interest. Therefore the model needs to be evaluated with other metrics, that highlights the classifying performance of the minority classes, such as MCC or F1-score. Substituting the accuracy metric with MCC or F1-score only helps to properly eval- uate the model and detect the model’s biases towards different classes. However, to mitigate the class imbalance problem the training data needs to be re-sampled. There are two common methods for re-sampling the training data, namely under- sampling and oversampling[18]. When undersampling the training data, examples from the majority class are re- moved from the data set until the class distribution is balanced. The examples that should be removed can be selected at random or by identifying noisy instances of the majority class. However, in some cases the training set is small and therefore undersampling the data set is impractical. In such cases oversampling is used, where 17 2. Background more examples from the minority class are added to the training set to balance the class distribution. The new examples are created by copying existing examples from the minority class. The copies are then either kept as is or minor modifications are made to the continuous attributes of the copy. 2.2.8.4 Overfitting Overfitting is when the model fits well to the training set but fails to generalize to unseen data[27]. This is due the model trying to describe all the data points rather than the underlying distribution of the data [11] as seen in Figure 2.9. A common approach for avoiding overfitting is regularization. Figure 2.9: Example of a complex overfitted model. In flexible algorithms such as Logistic Regressors and SVMs, the parameters that maximizes the training score are selected during training. This favours overly com- plex models as seen in Figure 2.9. To penalize complex models, such as SVMs and Logistic Regressors, regularization is used during training. Regularization is the method of using a regularizer, which quantifies the complexity of a model, together with the performance measures when selecting a model. 2.2.8.5 Feature Engineering Feature engineering is the process of constructing, extracting and selecting features. Features are the characteristics of an object, that are representations of or calcula- tions made with the object’s attributes. Consider the data seen in Table 2.9, where the nominal attribute Detection Activity is used as feature without any modifica- tions. Training a SVM with the given data would not be possible, since the inputs cannot be placed in a numerical space. Therefore the attributes of an object, has to be represented in a way that can be interpreted by the learning algorithm. 18 2. Background ID Detection Activity Asset 1 Customer Login 2 Function Test Validator Table 2.9: Example of nominal defect data A common way of constructing features from nominal attributes is One-Hot en- coding. One-hot encoding creates a new dichotomous feature for each possible value of the attribute, where the value 1 indicates the presence of that value. One-hot encoding the Detection Activity attribute would yields the features seen in Table 2.10, which now can be used for training a SVM. ID Function Test Customer Asset 1 0 1 Login 2 1 0 Validator Table 2.10: Example of features constructed from nominal defect data from Table 2.9 using one-hot encoding Now consider the data seen in Table 2.11, where the attribute Description contains human-written text. Treating the attribute as a nominal attribute and constructing features from it using one-hot encoding would yield the data seen in Table 2.12. When calculating the hamming distance between the unseen entry x and other en- tries, the resulting distance would be the same. This is because the similarities between the texts are not quantified, either the texts are identical or not. Therefore text is usually represented as a bag of words [11], which is a feature vector that describes the occurence of each word in the text. ID Description Asset x Login fails - 1 Login not working Login 2 Validate function incorrect Validator Table 2.11: Example of human-written defect description. ID Login not working Validate function incorrect Asset x 0 0 - 1 1 0 Login 2 0 1 Validator Table 2.12: Example of features constructed from human-written text using one- hot encoding A simple bag of words representation is constructed by first creating a vocabu- lary. The vocabulary is constructed by collecting the words of all entries used for 19 2. Background training into a set. The set will now contain all the distinct words found in all documents. After the vocabulary has been constructed, the number of occurrences of each word in the vocabulary is counted for each entry. The resulting vectors are then used as a representation of each document. Applying this technique to the data seen in Table 2.11 would yield the features seen in Table 2.13. Now when calculating the hamming distance, the closest entry to the unseen entry x is entry 1. ID Login not working Validate function incorrect Asset x 1 0 0 0 0 0 - 1 1 1 1 0 0 0 Login 2 0 0 0 1 1 1 Validator Table 2.13: Example of features constructed from human-written text using a simple bag of words model To highlight the words that distinguish each textual entry, also known as docu- ments, a method called Term Frequency - Inverse Document Frequency (TF-IDF) can be used. Similar to the simple bag of words model, features are constructed by first creating a vocabulary. After the vocabulary has been constructed, the fre- quency of each word in the vocabulary is calculated for each document. The term frequency is calculated by using the function seen in Formula 2.22 for each word and document where w denotes a word and d denotes a document. tf(w, d) = # occurences of w in d # words in d (2.22) To weight each word based on the occurrence across all documents, the inverse document frequency is calculated for each word. The inverse document frequency is calculated by using the logarithmic function seen in Formula 2.23 where w denotes a word. By using a logarithmic function the words that occur frequently across all documents receive a weight close to zero and the words that occur rarely receive a higher weight. idf(w) = # number of documents # number of documents containing w (2.23) Each position of the final feature vector, which denotes a word in the vocabulary, is derived by multiplying the term frequency with the inverse document frequency for each document. The complete function for TF-IDF can be seen in Formula 2.24. tf − idf(w, d) = tf(w, d) · idf(w) (2.24) 20 3 Related Work Previously no research has been conducted in predicting the asset attribute of a defect report. However, using machine learning for predicting other defect attributes and improving the process of defect resolution is not uncommon. In the systematic mapping study conducted by Cavalcanti et al. [28], it was reported that most research on the the topic of defect classification is centered on defect assignment and duplicate detection. Furthermore, Cavalcanti et al. noted that previous research could be extended to classify other defect attributes, such as asset and severity, that are recorded manually. This could be helpful for less experienced reporters when reporting a defect. 3.1 Defect Assignment Automating the task of defect assignment is a well-researched topic. In a systematic review [29], 75 papers researching the topic was reviewed where one of the most common approaches of automating defect assignment being machine learning. Previous studies have evaluated several classification techniques with the most com- mon evaluation metric being accuracy [29]. The reported accuracies of individual classifiers range from 25 % [30] to 64 % [31] with a higher accuracy being reported by with the use of meta-algorithms [32][7]. The most common features are constructed from the descriptions of defect reports by using TF-IDF [29]. However categorical attribute such as Asset [31][30][33][32][34], Artifact [30][31], Submitter [35][34] and Detection Activty [34] are also used. When evaluating the developed classification model Banitaan et al. [33] reported that the features constructed from the asset attribute were the most influential fea- tures for three out of four data sets that were used. Similar findings were reported by Annvik et al. [31], who improved the accuracy of their classification model for Eclipse by 48%, when using features constructed from the asset attribute. However, both papers validated the performance of their classification models using resolved defect reports. Therefore, the effect of the asset attribute being assigned an incorrect value was never observed. 21 3. Related Work 3.2 Duplicate Defect Report Detection A duplicate defect report describes a defect that already has been reported and submitted into the DTS. Duplicate defects are not wanted in the DTS since inves- tigating it would inherently mean that two different assigned entities are doing the same work. Therefore, organizations often have designated staff investigating the incoming defect reports for duplicates [28]. To automate the process of detecting duplicate defect reports several methods have been proposed utilizing the description of the defect report to measure the similarity between defect reports [36][37][38][39]. By using TF-IDF to construct features from the defect description Jalbert et. al [37] developed a model that managed to filter out 8 % of the duplicate defect reports before reaching the DTS. With a detection rate of 8%, the model could not replace the manual labour required for detecting duplicates. However the model could reduce the number of defect reports to be inspected manually. Since the model’s detection rate of non-duplicates was 100 %, the only cost of deploying the model would be the cost of classifying each defect re- port. The time required for classifying a defect report was reported to be 20 seconds. Tian et al. [39] developed a classification model with a detection rate of 24 % by also including features constructed from the categorical attributes of a defect report. However, the improvement of the detection rate led to a loss of 9 % of the non-duplicate detection rate. 3.3 Text Classification Text classification is the task of organizing unlabeled documents into categories. In a paper about text classification with SVMs, Thorsten [40] identified some of the challenges of text classification. Even when removing the stop words of a text corpus the number of remaining words will still be a considerable amount. This high dimensional input space can lead to problems with generalizability for some algorithms. A technique for reducing the dimensions of the input space is to remove features that are irrelevant. However, Thorsten [40] noted that few of the features constructed from the words of the corpus are irrelevant and therefore removing the features would result in information loss. The algortihms discussed in the Background section namely Naive Bayes, Logis- tic Regression, Decision Tree, Support Vector Machines, Artificial Neural Networks and K-Nearest Neighbors can all be used for text classification [41]. However, ac- cording to Thorsten [40] most text classification problems are linearly separable which benefits algorithms using linear functions to separate the categories. There- fore, linear SVMs, Naive Bayes and Logistic regression are common algorithms used for text classification. However, as stated by the No Free Lunch Theorem [11] there is no one algorithm that works best for every problem. 22 3. Related Work 3.4 Defect Prediction In Industry Prediction models in the domain of software defects are widely researched where most of the previously presented studies are related to defects in open source soft- ware. However, in the domain of large-scale software projects in industry, research has also been conducted. Among the problems studied in this domain are: Defect Inflow Prediction [42][43] and Software Defect Prediction [44]. Defect Inflow Prediction is the task of predicting the number of non-redundant defects being reported into the DTS [43]. In study conducted by Staron et al. [43] a model for predicting the defect inflow during the planning phase for up to 3 weeks in advance was proposed. The model was constructed using multivariate linear regres- sion and modeled the defect inflow as a function of characteristics of work packages. The results showed that the model could support project managers to estimate the work effort needed for completing the project by providing a prediction accuracy of defect inflow of 72%. To make the testing phase more efficient a Software Defect Prediction (SDP) model can be used. SDP is the task of predicting software assets which are prone to defects. By using a SDP model organizations can make testing more efficient by allocating more resources to the predicted assets [44]. Predicting assets prone to defect can be done using machine learning. Rana et al. [44] highlights that the problem can be either a classification or a regression problem. A classification model for SDP classifies modules, that are represented by software metrics and code attributes, as fault-prone or non-fault-prone based on previous projects [45]. For this task several algorithms including Naive Bayes, Logistic Re- gression, Decision Trees, Support Vector Machines, Artificial Neural Networks and K-Nearest Neighbors has been used and shown significant results. However, the comparative study conducted by Lessmann et al. [45] concluded that the classifi- cation algorithm had little importance when comparing the performance of the 17 most accurate classifiers that were studied. Even though Machine Learning has been used to developing models for SDP, Rana et al. [44] identified that the adoption of Machine Learning for SDP in industry has been limited. The study showed that attributes other attributes than predictive accuracy such as cost, reliability and generalizability affect the willingness to adopt the technology. To accelerate the adoption of Machine Learning in industry, Rana et al. [44] developed a framework for comparing Machine Learning based techniques to existing systems. Using the framework would help industry to make informed decisions and reflect on their strengths and areas of improvement with respect to a given technology. 23 3. Related Work 24 4 Research Design 4.1 Research Questions The aim of this research is to develop a model for classifying the asset from which a defect originate using historic defect reports. Furthermore, the model is evaluated using metrics that are appropriate for the given data set and by measuring the po- tential time-savings at a large telecom company. This leads to the main research question: RQ 1: How can the source asset of a given defect report be identified using machine learning? In order to develop a classification model for classifying the asset from which a defect originates, qualitative features need to be constructed. This leads to the first sub question: RQ 1.1: Which attributes of the defect reports can be used for constructing fea- tures? When the available attributes have been determined, the initial set of features are constructed. However, some features might not contain information that describes the functional relationship between the defect report and the asset which might de- crease the performance of the classification model. Therefore, a suitable feature set needs to be determined. This leads to the second sub question: RQ 1.2: Which set of features provides the best results using k-fold cross-validation? This question is addressed by evaluating a number of supervised classification algo- rithms with a combination of different subsets of features. The outcome will show which combination provides the most accurate classifications. The answers of the sub questions will be used to answer the main research question. In further detail, the answers will provide a classification model for classifying the asset from which defect originates, which will be evaluated by measuring accuracy, recall, precision, F1-score, the Matthews Correlation Coefficient. 25 4. Research Design 4.2 Research Methodology The research conducted followed the design science research methodology seen in Figure 4.1. The research started by analyzing the defect reports of the organization which confirmed that, similar to Microsoft [6], the asset attribute oftentimes were assigned incorrect values. This identified the need of aiding the defect reporter by recommending a value for the asset attribute. Figure 4.1: Design science research methodology as explained by Hevnet et al. [46] To gain applicable knowledge for developing a classification model, capable of clas- sifying the asset from which a defect originates from, a literature review was con- ducted. The outcome of the literature review was information about classification algorithms to consider, the process of developing and evaluating a model, previously applied techniques for predicting defect attributes and the life cycle of a defect in the organization which is presented in Section 2 and 3. Given the organizations data set, the possibilities of using machine learning for predicting the asset attribute, to justify the development of a classification model, was evaluated. To get familiar with the data set and it’s attributes data exploration and mining techniques were used. The data exploration showed that the historic defect reports were labeled with the asset from which the defect was removed and could therefore be used to train a classification model. After the previous steps, the development and evaluation of different algorithms, together with different subsets of features started. 26 4. Research Design 4.2.1 Data set The data provided by the organization contained defect reports, submitted by both customers and employees, starting from 2010, grouped by two different products. The two products, denoted as Product 1 and Product 2 in this thesis, are mature products built by several million lines of code with a few hundred active developers each. Furthermore, the products are deployed internationally and each have more than 10,000 submitted historic defect reports. The data set containing the historic defect reports was imported into a table to act as a snapshot. Only attributes that were mandatory to record when creating a defect report were included in the data set to avoid missing data. The mandatory attributes were: Description, Severity, Detection Activity, Artifact and Asset. These attributes were a subset of the attributes presented in Table 2.1. Furthermore, the description of the mandatory attributes with their respective variable types can be seen in Table 4.1. Attribute Type Description Description Textual Description of what is missing, wrong, or unnecessary. Severity Ordinal The highest failure impact that the defect could (or did) cause, as determined by (from the perspective of) the organization responsible for software engineering. Detection Activity Nominal The activity during which the defect was detected (i.e., inspection or testing). Artifact Nominal The specific software work product containing the defect. Asset Nominal The software asset (product, component, module, etc.) containing the defect. Table 4.1: The mandatory attributes of the defect reports that were provided by the telecom company. Defect reports which were not addressed or resolved were removed since they could not have been used during training or validation. This was because the asset attribute of the defect reports could be reassigned during the resolution process. Therefore, the assigned value could be incorrect since the value was not final. Fur- thermore the defect reports that described failures which were not caused by a defect were removed. These failures could, for instance, be caused by not following the doc- umentation when configuring the system. Since the reported failure was not caused by a defect, it would not result in a corrective measure in an asset. Therefore, the assigned value of the asset attribute could be considered incorrect. A lower bound for the number of possible assets for each product can be seen in Table 4.2. Product #Assets Product 1 >40 Product 2 >100 Table 4.2: The number of possible assets for each product. 27 4. Research Design 4.2.2 Development Setup The development and evaluation was performed on a laptop running Windows 7. The limited performance affected the time required to run the experiments. If, for instance, the experiments were conducted on a mainframe computer, each iteration of the 10-Fold Cross-Validation could have been run in parallel. This would have been useful when evaluating algorithms with a slow training phase, such as SVMs, as the computer is unusable during training. Furthermore, this limited the possibil- ities of tuning each algorithm since each tuning task requires exhaustive search over the parameter values. The optimal value is decided by comparing the score of each parameter value using cross-validation on the training set. The framework that was used for feature engineering and machine learning tasks is Scikit-learn [47]. Scikit-learn offers a vast library of machine learning algorithms and tools for both feature engineering and model evaluation. The selection of the framework was made based on the ease of use, community support and familiarity with Python. 4.2.3 Algorithms The classification algorithms evaluated for constructing a classification model can be seen in Table 4.3. The set of algorithms was decided based on the usage in related research. As stated by the No Free Lunch Theorem [11] there is no single algorithm that works best for every problem and therefore the most suitable algorithm were decided through the conducted tests. Furthermore, a neural network implementation was excluded due to the long training phase and the large number of parameters needed to be tuned. Classification Algorithm Description MultinomialNB [48] Naive Bayes implementation for multinomially distributed data. DecisionTreeClassifier [49] A decision tree implementation for classification that uses a optimized version of CART [50]. LogisticRegression [51] A Logistic Regression implementation for classification. KNeighborsClassifier [52] A K-Nearest Neighbors implementation that uses five neighbors as default. LinearSVC [53] A SVM for classification that uses a linear kernel with the cost parameter set to 1 as default. Table 4.3: Descriptions of the evaluated classification algorithms. 28 4. Research Design 4.2.4 Feature engineering From the mandatory defect attributes seen in Table 4.1 a set of features were con- structed. To represent the textual attribute Description as features, the attribute was transformed into a feature vector by using TF-IDF. The TF-IDF implementa- tion of Scikit-learn [54] also offered removal of stop words and tokenizations of words using regular expressions. This was utilized to remove all the stop words defined by Scikit-learn [55] and tokenize the words so that they only contained at least one alphabetical character using the following pattern: (?ui)\\b\\w ∗ [a− z] + \\w ∗ \b. The stop words and numerical sequences were removed since they were irrelevant for describing the attribute. The 1000 words which had the highest TF-IDF score were selected since a higher number of words would have increased the training time of the model. However, the number of words that are selected from a TF-IDF vector is a parameter that needs to be tuned which is not something that is considered in this thesis. The categorical attributes of the defect reports were transformed into features by using Scikit-learn’s algorithm for One-Hot encoding [56]. 4.2.5 Iteration 1 The first iteration evaluated the classification scheme seen in Figure 4.2. Using this scheme, the reporter would not assign the asset attribute a value. Instead, the trained classification model would decide the asset, based on the attributes recorded by the reporter. To decide which attributes the model should use when classifying the asset three tests were conducted. The three tests were: 1. Learning with features constructed from the categorical attributes 2. Learning with features constructed from the textual attributes 3. Learning with a combination of the features used for the previous tests Furthermore, to evaluate which algorithm provided the best result, each test was conducted with the same set of classification algorithms, described in Table 4.3. Figure 4.2: Classification scheme for Iteration 1 The metrics used for evaluation during the tests were Accuracy, Precision, Recall, F1-Score and Matthews Correlation Coefficient. These metrics were used to measure how the classifier performs across all classes, which the accuracy measure fails to 29 4. Research Design do in unbalanced data sets. However, while these measures show the significance of the classifier the organization has determined that accuracy is the most important metric. To validate the measures of each classification model, 10-fold cross-validation with stratified folds was used which were described in Chapter 2. This method of vali- dation has shown to provide a better estimation of performance than lower values of K [26] while higher values of K come at higher computational cost and bias. Furthermore, stratified fold were used to reduce the variance between the different cross-validation iterations and to maintain the class distribution upon validation. Three benchmarks were created using three different setups, described in Table 4.4, of Scikit-learn’s DummyClassifier. These setups represent three different clas- sifiers, that do not learn anything about the underlying relationships between the features and the target variable but uses the class distribution of the training data to classify the asset of given defect report. The benchmarks values, seen in Table 4.5 for Product 1 and Table 4.6 for Product 2, were used as lower bound for eval- uating the possibilities of using machine learning to classify the asset from which a defect originates. If the performance of the classification model did not exceed these benchmarks then it had not gained any knowledge about the functional relationship between the features and the asset and was therefore not considered to be useful by the telecom company. Setup Description stratified Classifies entries based on the distribution of the label class in the training set. most_frequent Classifies entries based on the most frequent label in the training set. uniform Classifies entries uniformly at random. Table 4.4: Descriptions of the different setups of the DummyClassifier used in this thesis. The benchmark values, presented in Table 4.5 and Table 4.6, shows that there is no correlation between the models classifications and the actual assets since the MCC is equal to or less than zero. The highest accuracy is achieved by always classifying the asset as the most frequently occurring asset in the training data for a given defect report. The models using this strategy, correctly classifies the asset of 13.1% of the defect reports for Product 1 and 9.4% of the defect reports for Product 2. However, it results in a macro-averaged precision and recall close to zero since the models fail to classify any other asset other than the most frequent one. Using any other strategy results in a lower accuracy which in most cases is close to zero. 30 4. Research Design Setup Accuracy Precision (Macro) Recall (Macro) F1-score (Macro) MCC stratified 8.01% 2.2% 2.44% 2.13% -0.001 most_frequent 13.1% 0.3% 2.33% 0.54% 0.0 uniform 2.37% 2.26% 2.54% 1.65% -0.001 Table 4.5: The benchmark values by setup for Product 1. Setup Accuracy Precision (Macro) Recall (Macro) F1-score (Macro) MCC stratified 2.85% 0.8% 0.81% 0.78% -0.002 most_frequent 9.4% 0.08% 0.85% 0.15% 0.0 uniform 0.79% 0.83% 0.93% 0.63% -0.001 Table 4.6: The benchmark values by setup for Product 2. 4.2.6 Iteration 2 The second iteration evaluated the classification scheme seen in Figure 4.3. The first iteration did not make use of the reporter’s expertise. Instead of deterministically classifying the source asset of a given defect report, the model in the second iteration would provide the reporter with a list of likely assets for the recorded attributes of the defect report. This would leave the final selection of the source asset to the re- porter who would use their expertise to select an asset from the list of recommended assets. Furthermore, the recommendation system could also aid the reporters who have little to no expertise since the list of possible assets would be shortened. Figure 4.3: Classification scheme for Iteration 2 Similar to the first iteration, three tests were conducted to decide which attributes the model should use when classifying the recommendations. The three tests were: 1. Learning with features constructed from the categorical attributes 2. Learning with features constructed from the textual attributes 3. Learning with a combination of the features used for the previous tests Furthermore, to evaluate which algorithm provided the best result, each test was conducted with the same set of classification algorithms, described in Table 4.3. As the performance metrics for recommendation systems differ from metrics used for classification models, different metrics than the metrics used for Iteration 1 were used. The metrics used for evaluation during the tests were Recall@n and Preci- sion@n. Recall@n is the previously presented recall metric extended to recommen- 31 4. Research Design dation systems and is calculated by using the formula seen in Formula 4.1 where n denotes the number of recommended items. recall@n = |relevant recommended items @n| |relevant items| (4.1) Calculating the Recall@3 for the recommendations seen in Table 4.7 results in a value of approximately 0.667 = 66.7%. This is because the number of relevant recommended assets is 2 and the number of relevant assets is 3 across all recommen- dations. Id Recommended Assets Correct Asset |Relevant Recommended Assets| |Relevant Assets| 1 [Login, Validator, Load Balancer] Login 1 1 2 [Decoder, Login, Load Balancer] Validator 0 1 3 [Router, Validator, Decoder] Router 1 1 Table 4.7: Example of recommendations provided by a system that recommends three assets for a given defect report. The precision@n metric is the precision metric extended to recommendation systems and is calculated by using the formula seen in Formula 4.2. precision@n = |recommended relevant item @n| |recommended items| (4.2) Since each defect report only has one correct asset the accuracy metric equals the recall@n metric. Furthermore, the number of recommended items, used by the pre- cision@n metric, can be simplified to n×|relevant items|. Therefore the precision@n measure can be simplified to the formula seen in Formula 4.3. precision@n = recall@n n (4.3) Using the Recall@3 calculated for the recommendations seen in Table 4.7, results in the Precision@3 which can be seen in Formula 4.4. precision@3 = recall@3 3 ≈ 0.222 = 22.2% (4.4) Similar to the first iteration, 10-fold cross validation with stratified folds was used to validate the measures of each classification model. Furthermore a benchmark was developed to emulate a recommendation system that uses the class distribution of the training data to provide recommendations. The benchmark values for each product can be seen in 4.8. Product Recall@1 Recall@3 Recall@5 Recall@10 Product 1 13.1% 36.75% 52.15% 78.71% Product 2 9.4% 20.16% 26.66% 40.16% Table 4.8: The benchmark values of the recommendation system for each product. 32 5 Results In this chapter the results of each iteration is presented. Section 5.1 presents the results of Iteration 1 which describes the performance of classification models trained with three different feature sets. Section 5.2 presents the results of Iteration 2 which describes the performance of recommendation models trained with three different feature sets. Furthermore, in Section 5.2 the feature set with the best performing models was used to create charts which shows the balance between precision and recall for each classification model. 5.1 Iteration 1 This section presents the results of the first iteration. The first iteration aimed to evaluate which set of features and classification algorithm provided the best perfor- mance when constructing a classification model for classifying the asset from which a defect originates. For each feature set the classification model with the maximum accuracy is bolded. To complement the accuracy metric, metrics such as macro- averages of Precision, Recall and F1-Score are provided. Furthermore, to show the correlation between the predictions and the results the MCC metric has been pro- vided. Table 5.1 and 5.2 shows the performance of the classification models trained with features constructed from the categorical attributes of defect reports from Product 1 and Product 2 respectively. For Product 1 the maximum accuracy of 26.82% is achieved by using LogisticRegression. For Product 2 the maximum accuracy of 29.56% is achieved by using LinearSVC. Classifier Accuracy Precision (Macro) Recall (Macro) F1-score (Macro) MCC MultinomialNB 26.28% 7.47% 8.17% 7.29% 0.186 DecisionTreeClassifier 25.93% 8.62% 8.36% 7.88% 0.183 LogisticRegression 26.82% 7.87% 8.23% 7.31% 0.191 KNeighborsClassifier 21.81% 8.45% 7.52% 7.15% 0.147 LinearSVC 26.43% 7.48% 8.27% 7.24% 0.185 Table 5.1: Results for models trained with features constructed from the categorical attributes of defect reports from Product 1 33 5. Results Classifier Accuracy Precision (Macro) Recall (Macro) F1-score (Macro) MCC MultinomialNB 28.72% 11.33% 13.77% 10.61% 0.267 DecisionTreeClassifier 29.3% 17.16% 18.4% 15.5% 0.273 LogisticRegression 29.47% 14.34% 16.82% 13.35% 0.275 KNeighborsClassifier 22.76% 14.66% 15.84% 13.62% 0.205 LinearSVC 29.56% 14.95% 17.98% 14.33% 0.276 Table 5.2: Results for models trained with features constructed from the categorical attributes of defect reports from Product 2 Table 5.3 and 5.4 shows the performance of the classification models trained with features constructed from the textual attributes of defect reports from Product 1 and Product 2 respectively. For Product 1 the maximum accuracy of 57.36% is achieved by using LinearSVC. For Product 2 the maximum accuracy of 37.17% is achieved by using LinearSVC. Classifier Accuracy Precision (Macro) Recall (Macro) F1-score (Macro) MCC MultinomialNB 47.17% 15.96% 13.35% 12.98% 0.42 DecisionTreeClassifier 39.67% 17.75% 16.86% 16.72% 0.347 LogisticRegression 55.36% 25.04% 18.56% 19.05% 0.512 KNeighborsClassifier 38.5% 17.99% 15.65% 15.57% 0.331 LinearSVC 57.36% 31.61% 24.72% 26.02% 0.536 Table 5.3: Results for models trained with features constructed from the textual attributes of defect reports from Product 1 Classifier Accuracy Precision (Macro) Recall (Macro) F1-score (Macro) MCC MultinomialNB 27.12% 11.47% 8.31% 7.33% 0.245 DecisionTreeClassifier 24.67% 13.97% 13.44% 13.11% 0.226 LogisticRegression 35.12% 17.44% 13.0% 12.96% 0.328 KNeighborsClassifier 20.77% 15.75% 11.32% 11.22% 0.188 LinearSVC 37.17% 23.41% 19.88% 20.07% 0.352 Table 5.4: Results for models trained with features constructed from the textual attributes of defect reports from Product 2 Table 5.5 and 5.6 shows the performance of the classification models trained with features constructed from both the categorical and textual attributes of defect re- ports from Product 1 and Product 2 respectively. For Product 1 the maximum accuracy of 58.82% is achieved by using LinearSVC. For Product 2 the maximum accuracy of 48.64% is achieved by using LinearSVC. Classifier Accuracy Precision (Macro) Recall (Macro) F1-score (Macro) MCC MultinomialNB 48.92% 18.81% 15.83% 15.55% 0.441 DecisionTreeClassifier 39.91% 18.57% 17.75% 17.72% 0.35 LogisticRegression 56.93% 28.65% 21.12% 21.79% 0.529 KNeighborsClassifier 34.44% 16.16% 14.72% 14.32% 0.286 LinearSVC 58.52% 33.93% 27.89% 28.91% 0.549 Table 5.5: Results for models trained with features constructed from both the textual and categorical attributes of defect reports from Product 1 34 5. Results Classifier Accuracy Precision (Macro) Recall (Macro) F1-score (Macro) MCC MultinomialNB 38.35% 22.55% 18.42% 16.8% 0.366 DecisionTreeClassifier 32.66% 23.59% 23.29% 22.61% 0.308 LogisticRegression 47.75% 34.1% 29.26% 28.68% 0.462 KNeighborsClassifier 34.25% 25.11% 25.01% 23.17% 0.324 LinearSVC 48.64% 38.06% 35.76% 35.14% 0.472 Table 5.6: Results for models trained with features constructed from both the textual and categorical attributes of defect reports from Product 2 5.2 Iteration 2 This section presents the results of the second iteration. The second iteration aimed to evaluate which set of features and classification algorithm provided the best per- formance when constructing a classification model for providing the reporter with recommendations of which asset a defect originates. For each feature set the classifi- cation model with the maximum Recall@5 is bolded. The metrics Recall@1,Recall@3 and Recall@10 are also provided to show the relationship between Recall and the number of recommendations. Furthermore, the results of the feature set which pro- vided the highest providing classifier for each product are complemented with a chart which shows the relationship between Recall and Precision for different number of recommendations. Table 5.7 and 5.8 shows the performance of the classification models trained with features constructed from the categorical attributes of defect reports from Product 1 and Product 2 respectively. For Product 1 the maximum Recall@5 of 62.72% is achieved by using LogisticRegression. For Product 2 the maximum accuracy of 62.35% is achieved by using LinearSVC. Classifier Recall@1 Recall@3 Recall@5 Recall@10 MultinomialNB 26.29% 48.6% 62.54% 81.39% DecisionTreeClassifier 25.94% 45.98% 59.37% 75.89% LogisticRegression 26.83% 48.6% 62.72% 81.66% KNeighborsClassifier 21.68% 37.11% 44.42% 48.56% LinearSVC 26.36% 48.54% 62.65% 81.72% Table 5.7: Results for recommendation model trained with features constructed from categorical attributes of defect reports from Product 1 35 5. Results Classifier Recall@1 Recall@3 Recall@5 Recall@10 MultinomialNB 28.74% 48.12% 60.58% 79.58% DecisionTreeClassifier 29.32% 48.58% 59.3% 75.86% LogisticRegression 29.49% 49.91% 61.92% 80.95% KNeighborsClassifier 23.02% 38.11% 45.03% 47.69% LinearSVC 29.39% 49.43% 62.35% 81.55% Table 5.8: Results for recommendation model trained with features constructed from categorical attributes of defect reports from Product 2 Table 5.9 and 5.10 shows the performance of the classification models trained with features constructed from the textual attributes of defect reports from Product 1 and Product 2 respectively. For Product 1 the maximum Recall@5 of 86.74% is achieved by using LinearSVC. For Product 2 the maximum accuracy of 70.06% is achieved by using LinearSVC. Classifier Recall@1 Recall@3 Recall@5 Recall@10 MultinomialNB 47.16% 70.08% 79.28% 89.23% DecisionTreeClassifier 39.86% 40.59% 41.08% 52.43% LogisticRegression 55.35% 76.89% 84.33% 92.13% KNeighborsClassifier 38.49% 57.08% 64.67% 68.94% LinearSVC 56.35% 78.73% 86.74% 94.03% Table 5.9: Results for recommendation model trained with features constructed from textual attributes of defect reports from Product 1 Classifier Recall@1 Recall@3 Recall@5 Recall@10 MultinomialNB 27.12% 47.51% 58.01% 73.04% DecisionTreeClassifier 24.83% 27.16% 28.43% 31.38% LogisticRegression 35.11% 56.53% 66.43% 79.67% KNeighborsClassifier 20.76% 35.31% 43.29% 45.98% LinearSVC 37.02% 60.19% 70.06% 82.9% Table 5.10: Results for recommendation model trained with features constructed from textual attributes of defect reports from Product 2 Table 5.9 and 5.10 shows the performance of the classification models trained with features constructed from both the categorical and textual attributes of defect re- ports from Product 1 and Product 2 respectively. For Product 1 the maximum Recall@5 of 86.59% is achieved by using LinearSVC. For Product 2 the maxi- mum accuracy of 81.9% is achieved by using LinearSVC. 36 5. Results Classifier Recall@1 Recall@3 Recall@5 Recall@10 MultinomialNB 48.91% 70.44% 79.59% 89.06% DecisionTreeClassifier 40.17% 40.87% 41.32% 52.31% LogisticRegression 56.92% 77.75% 85.09% 92.63% KNeighborsClassifier 34.43% 51.7% 59.86% 63.59% LinearSVC 57.91% 79.08% 86.59% 94.1% Table 5.11: Results for recommendation model trained with features constructed from both the textual and categorical attributes of defect reports from Product 1 Classifier Recall@1 Recall@3 Recall@5 Recall@10 MultinomialNB 38.36% 59.19% 70.21% 84.65% DecisionTreeClassifier 32.67% 34.95% 35.99% 38.45% LogisticRegression 47.76% 70.54% 80.26% 91.15% KNeighborsClassifier 34.27% 52.71% 60.04% 61.84% LinearSVC 48.8% 71.75% 81.9% 92.34% Table 5.12: Results for recommendation model trained with features constructed from both the textual and categorical attributes of defect reports from Product 2 Figure 5.1 and 5.2 shows the relationship between Recall and Precision for differ- ent numbers of recommendations of the classification models trained with features constructed from both the categorical and textual attributes of defect reports from Product 1 and Product 2 respectively. Furthermore, the figures include the values of the baseline model as a point of reference. (a) recall@n (b) precision@n Figure 5.1: Resulting (a) recall@n and (b) precision@n for recommendation model trained with features constructed from both textual and categorical attributes of defect reports from Product 1 37 5. Results (a) recall@n (b) precision@n Figure 5.2: Resulting (a) recall@n and (b) precision@n for recommendation model trained with features constructed from both textual and categorical attributes of defect reports from Product 2 38 6 Discussion In this chapter the results of each iteration is discussed. Furthermore, the possible future work and the threats to validity are discussed. 6.1 Iteration 1 The first iteration evaluated which set of features and classification algorithm pro- vided the best performance when constructing a classification model for classifying the asset from which a defect originates. All classification models that were trained during the first iteration, achieved higher scores on all measures, than the bench- mark values seen in Table 6.1 and 6.2. Setup Accuracy Precision (Macro) Recall (Macro) F1-score (Macro) MCC stratified 8.01% 2.2% 2.44% 2.13% -0.001 most_frequent 13.1% 0.3% 2.33% 0.54% 0.0 uniform 2.37% 2.26% 2.54% 1.65% -0.001 Table 6.1: The benchmark values by setup for Product 1. Setup Accuracy Precision (Macro) Recall (Macro) F1-score (Macro) MCC stratified 2.85% 0.8% 0.81% 0.78% -0.002 most_frequent 9.4% 0.08% 0.85% 0.15% 0.0 uniform 0.79% 0.83% 0.93% 0.63% -0.001 Table 6.2: The benchmark values by setup for Product 2. Across all features sets used to train classification models, the models using Logisti- cRegression and LinearSVC provided the highest accuracies. However, the choice of algorithm did not make a significant difference on the F1-Score and MCC when only using features constructed from the categorical attributes. Comparing the measures of the best performing models using either features constructed from the textual attributes or the categorical attributes shows that the textual attributes provides more information for classifying the asset. The best performing classification model for each product were trained with features constructed from both the categorical and textual attributes using LinearSVC seen in Table 6.3. 39 6. Discussion Product Classifier Accuracy Precision (Macro) Recall (Macro) F1-score (Macro) MCC Product 1 LinearSVC 58.52% 33.93% 27.89% 28.91% 0.549 Product 2 LinearSVC 48.64% 38.06% 35.76% 35.14% 0.472 Table 6.3: The best performing classification model for each product. The best performing classification model developed with the defect reports of Prod- uct 1 achieves an accuracy of 58.52%. However, the model only achieves a macro- averaged precision 33.93% and a macro-averaged recall of 27.89%. This shows that when biasing the measures towards the least populated classes the probability of the classifier correctly labeling an unseen defect report is 33.93 % and the probability of correctly labeling a class given an instance of that class is 27.89%. The classifier achieves a MCC of 0.549 which indicates a positive correlation between the predic- tions and the correct labels. This implies that the classifier is making significant predictions that are determined randomly. Similar observations can be made from the results of the best performing classification model developed with the defect reports of Product 2. However, the model has a lower accuracy of 48.64% but a higher precision and recall of 38.06% and 35.75% respectively. 6.2 Iteration 2 The second iteration evaluated which set of features and classification algorithm provided the best performance when constructing a classification model for provid- ing the reporter with recommendations of which asset a defect originates. For both products, the feature set that provided the best classification model was constructed from both the categorical and textual attributes of the defect reports. The bench- mark values for evaluating the developed classification models can be seen in Table 6.4. Comparing the classification models using features constructed from both the categorical and textual attributes shows that the only models performing better than the benchmarks, for both products, were using LinearSVC, LogisticRegression or MultinomialNB. Product Recall@1 Recall@3 Recall@5 Recall@10 Product 1 13.1% 36.75% 52.15% 78.71% Product 2 9.4% 20.16% 26.66% 40.16% Table 6.4: The benchmark values of the recommendation system for each product. The best performing classification model for providing the reporter with a list of recommendations of assets, for both products, was constructed using LinearSVC. The Recall@k and Precision@k measures of this model can be seen in Table 6.5 and Table 6.6 respectively. For both products the Recall@n increases as the number of recommendations increases while the Precision@n decreases as the number of recommendations increases. 40 6. Discussion Product Classifier Recall@1 Recall@3 Recall@5 Recall@10 Product 1 LinearSVC 57.91% 79.08% 86.59% 94.1% Product 2 LinearSVC 48.8% 71.75% 81.9% 92.34% Table 6.5: The Recall@n of the best performing classifier of each product. Product Classifier Precision@1 Precision@3 Precision@5 Precision@10 Product 1 LinearSVC 57.91% 26.36% 17.32% 9.41% Product 2 LinearSVC 48.8% 23.92% 16.38% 9.23% Table 6.6: The Precision@n of the best performing classifier of each product. To compare the performance of the developed recommendation models with the current list of assets, the Recall@n and Precision@n is aggregated to the F1-Score which is the harmonic mean of the two measures. Since the balance between recall and precision is something that needs to be studied the recall and precision are weighted equally. The F1-Score for the classification models can be seen in Table 6.7. Furthermore, the Recall@n, Precision@n and F1@n measures for the currently provided lists of assets can be seen in Table 6.8. Product Classifier F1@1 F1@3 F1@5 F1@10 Product 1 LinearSVC 57.91% 39.54% 28.87% 17.11% Product 2 LinearSVC 48.8% 35.88% 27.3% 16.78% Table 6.7: The F1@n of the best performing classifier of each product. Comparing the F1-Score for the currently provided lists of assets and the best per- forming classification models shows that the recommendation systems perform bet- ter regardless of the number of recommendations when weighting Recall@n and Precision@n equally. Product n Recall@n Precision@n F1@n Product 1 > 40 100% < 2.5% < 4.88% Product 2 > 100 100% < 1% < 1.98% Table 6.8: The benchmark values of the recommendation system for each product. For instance, by providing a list of 10 recommendations, the Recall decreased by 5.9% for Product 1 and 7.66% for Product 2 while the Precision increased by more than 276% for Product 1 and 823% for Product 2. 6.3 Future Work To further the research with classifying the asset from which a defect originates using machine learning there are a few approaches that can be taken. 41 6. Discussion 6.3.1 Feature Engineering One possible approach is to focus on the features used for classification. In this thesis, two sets of features were constructed and evaluated. However, the degree of which the classification model learns from each feature can be evaluated. For instance, the severity attribute might not correlate with the asset from which the defect originates and could therefore be excluded. This would reduce the number of features which would reduce the classification and training time and also increase the quality of the classifications. Other than reducing the number of studied features, the existing features can be tuned. For instance, the number of words selected from the feature vector con- structed by the TF-IDF algorithm was 1,000. The number of selected words can be tuned by performing cross validation on the training set with different values of the parameter. An increase of the number of selected words might increase the performance of the classification model since significant words that distinguish each defect report might have been excluded when only selecting 1,000 words. 6.3.2 Tuni