Improving Defect Localization by Clas-
sifying the Affected Asset using Machine
Learning

Master’s thesis in Software Engineering

Sam Halali

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2018


Master’s thesis 2018

Improving Defect Localization by Classifying the
Affected Asset using Machine Learning

Sam Halali

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2018


Improving Defect Localization by Classifying the Affected Asset using Machine
Learning
SAM HALALI

© SAM HALALI, 2018.

Supervisor: Miroslaw Staron, Department of Computer Science and Engineering
Examiner: Jan-Philipp Steghöfer, Department of Computer Science and Engineer-
ing

Master’s Thesis 2018
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Typeset in LATEX
Gothenburg, Sweden 2018

iv


Improving Defect Localization by Classifying the Affected Asset using Machine
Learning
SAM HALALI
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
Today’s market demands complex large-scale software to be developed and deliv-
ered at an increased pace. The increase in software complexity increases the cost
of maintenance which on average accounts for 60 percent of software costs. Correc-
tive maintenance accounts for 21 percent of the maintenance costs which includes
receiving a defect report describing a defect, diagnosing and removing the described
defect. A vital part of a defect’s resolution is the task of defect localization. Defect
localization is the task of finding the exact location of the defect in the system. The
defect report, in particular the asset attribute, help the assigned entity to limit the
search space when investigating the exact location of the defect. However, research
has shown that oftentimes reporters initially assign values to these attributes that
provide incorrect information.

In this thesis, using machine learning to classify the source asset for a given de-
fect report at a telecom company was evaluated. Following design science research,
two iterations were conducted. The first iteration evaluated classification models
for classifying the source asset after submission of a defect report. By training a
SVM with features constructed from both categorical and textual attributes of the
defect reports an accuracy of 58.52% was achieved. The second iteration evaluated
classification models for providing the reporter with recommendations of likely as-
sets. By using recommendations provided by a SVM trained with features from
both categorical and textual attributes of the defect reports the precision could be
significantly increased.

Keywords: Machine Learning, Defect Localization, Defect predictions, Supervised
Learning, Text Classification, Recommendation Systems

v


Acknowledgements

I would like to thank Miroslaw Staron, my supervisor, for providing valuable feed-
back and suggestions throughout the thesis process. I would also like to thank the
telecom company at which this thesis was conducted. Finally, I would like to thank
a long list of people providing me with motivation, help and joy. Thank you:

Micael Caiman, WilhelmMeding, Per Sundvall, Mahmoud Halali, Maria Ahmadikhatir,
Emma Ahlberg, Fredrik Rahn, Simon Kindström, Marko Solunac, Patrik Olsson,
Michaela Fritiofsson and Jenin Grill.

Sam Halali, Gothenburg, June, 2018

vii


Contents

List of Figures xi

List of Tables xiii

1 Introduction 1
1.1 Problem Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Purpose of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Limitations and Delimitations . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 3
2.1 Defects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . 10
2.2.5 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . 12
2.2.6 K-Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.7 Performance Evaluation of Classification Models . . . . . . . . 14
2.2.8 Pitfalls in Supervised Learning . . . . . . . . . . . . . . . . . 16

2.2.8.1 Performance Validation of Classification Models . . . 16
2.2.8.2 Missing Attribute Values . . . . . . . . . . . . . . . . 16
2.2.8.3 Class Imbalance Problem . . . . . . . . . . . . . . . 17
2.2.8.4 Overfitting . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.8.5 Feature Engineering . . . . . . . . . . . . . . . . . . 18

3 Related Work 21
3.1 Defect Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Duplicate Defect Report Detection . . . . . . . . . . . . . . . . . . . 22
3.3 Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Defect Prediction In Industry . . . . . . . . . . . . . . . . . . . . . . 23

4 Research Design 25
4.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 26

ix


Contents

4.2.1 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.2 Development Setup . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.4 Feature engineering . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.5 Iteration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.6 Iteration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5 Results 33
5.1 Iteration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Iteration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6 Discussion 39
6.1 Iteration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.2 Iteration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.3.1 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . 42
6.3.2 Tuning Classification Algorithms . . . . . . . . . . . . . . . . 42

6.4 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.4.1 Conclusion Validity . . . . . . . . . . . . . . . . . . . . . . . . 42
6.4.2 Internal Validity . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.4.3 Construct Validity . . . . . . . . . . . . . . . . . . . . . . . . 43
6.4.4 External Validity . . . . . . . . . . . . . . . . . . . . . . . . . 43

7 Conclusion 45

x


List of Figures

2.1 The life cycle of a defect © 1993 IEEE . . . . . . . . . . . . . . . . . 4
2.2 The relationship between defects and other entities in maintenance ©

1993 IEEE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Example view of when reporting a defect. . . . . . . . . . . . . . . . . 5
2.4 The life cycle of a defect report at the telecom company . . . . . . . 5
2.5 A Plot of the function seen in Formula 2.11 and the data points seen

in Table 2.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6 Example of a decision tree . . . . . . . . . . . . . . . . . . . . . . . . 10
2.7 Example of possible hyperplanes for data linearly separable data seen

in Table 2.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.8 An artificial neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.9 Example of a complex overfitted model. . . . . . . . . . . . . . . . . . 18

4.1 Design science research methodology as explained by Hevnet et al. [46] 26
4.2 Classification scheme for Iteration 1 . . . . . . . . . . . . . . . . . . . 29
4.3 Classification scheme for Iteration 2 . . . . . . . . . . . . . . . . . . . 31

5.1 Resulting (a) recall@n and (b) precision@n for recommendation model
trained with features constructed from both textual and categorical
attributes of defect reports from Product 1 . . . . . . . . . . . . . . . 37

5.2 Resulting (a) recall@n and (b) precision@n for recommendation model
trained with features constructed from both textual and categorical
attributes of defect reports from Product 2 . . . . . . . . . . . . . . . 38

xi


List of Figures

xii


List of Tables

2.1 The attributes of a defect © 1993 IEEE . . . . . . . . . . . . . . . . 3
2.2 Description of the different variable types . . . . . . . . . . . . . . . . 6
2.3 Example of categorical defect report data with an unseen entry used

for a Naive Bayes model. . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Example of server metric data which are labeled with whether the

server is overloaded or not. . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Example data set used for training a decision tree model . . . . . . . 10
2.6 Example of server metric data which are labeled with whether the

server is overloaded or not. . . . . . . . . . . . . . . . . . . . . . . . . 11
2.7 Categorical defect report data with unseen entry x . . . . . . . . . . 13
2.8 The categorical defect report seen in Table 2.7 that has been encoded

using the one-hot method. . . . . . . . . . . . . . . . . . . . . . . . . 14
2.9 Example of nominal defect data . . . . . . . . . . . . . . . . . . . . . 19
2.10 Example of features constructed from nominal defect data from Table

2.9 using one-hot encoding . . . . . . . . . . . . . . . . . . . . . . . . 19
2.11 Example of human-written defect description. . . . . . . . . . . . . . 19
2.12 Example of features constructed from human-written text using one-

hot encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.13 Example of features constructed from human-written text using a

simple bag of words model . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1 The mandatory attributes of the defect reports that were provided
by the telecom company. . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 The number of possible assets for each product. . . . . . . . . . . . . 27
4.3 Descriptions of the evaluated classification algorithms. . . . . . . . . . 28
4.4 Descriptions of the different setups of the DummyClassifier used in

this thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5 The benchmark values by setup for Product 1. . . . . . . . . . . . . . 31
4.6 The benchmark values by setup for Product 2. . . . . . . . . . . . . . 31
4.7 Example of recommendations provided by a system that recommends

three assets for a given defect report. . . . . . . . . . . . . . . . . . . 32
4.8 The benchmark values of the recommendation system for each product. 32

5.1 Results for models trained with features constructed from the cate-
gorical attributes of defect reports from Product 1 . . . . . . . . . . . 33

xiii


List of Tables

5.2 Results for models trained with features constructed from the cate-
gorical attributes of defect reports from Product 2 . . . . . . . . . . . 34

5.3 Results for models trained with features constructed from the textual
attributes of defect reports from Product 1 . . . . . . . . . . . . . . . 34

5.4 Results for models trained with features constructed from the textual
attributes of defect reports from Product 2 . . . . . . . . . . . . . . . 34

5.5 Results for models trained with features constructed from both the
textual and categorical attributes of defect reports from Product 1 . . 34

5.6 Results for models trained with features constructed from both the
textual and categorical attributes of defect reports from Product 2 . . 35

5.7 Results for recommendation model trained with features constructed
from categorical attributes of defect reports from Product 1 . . . . . 35

5.8 Results for recommendation model trained with features constructed
from categorical attributes of defect reports from Product 2 . . . . . 36

5.9 Results for recommendation model trained with features constructed
from textual attributes of defect reports from Product 1 . . . . . . . 36

5.10 Results for recommendation model trained with features constructed
from textual attributes of defect reports from Product 2 . . . . . . . 36

5.11 Results for recommendation model trained with features constructed
from both the textual and categorical attributes of defect reports from
Product 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.12 Results for recommendation model trained with features constructed
from both the textual and categorical attributes of defect reports from
Product 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.1 The benchmark values by setup for Product 1. . . . . . . . . . . . . . 39
6.2 The benchmark values by setup for Product 2. . . . . . . . . . . . . . 39
6.3 The best performing classification model for each product. . . . . . . 40
6.4 The benchmark values of the recommendation system for each product. 40
6.5 The Recall@n of the best performing classifier of each product. . . . . 41
6.6 The Precision@n of the best performing classifier of each product. . . 41
6.7 The F1@n of the best performing classifier of each product. . . . . . . 41
6.8 The benchmark values of the recommendation system for each product. 41

xiv


1
Introduction

1.1 Problem Background

Today’s market demands complex large-scale software to be developed and deliv-
ered at an increased pace. The increase in software complexity increases the cost
of maintenance [1] which on average accounts for 60 percent of software costs [2].
Corrective maintenance accounts for 21 percent of the maintenance costs which in-
cludes receiving a defect report, diagnosing and removing it. In most cases, when
a defect is found it is initially documented, assigned to a team and then submitted
into a Defect Tracking System (DTS).

A defect report consists of several attributes that are vital during the life cycle
of a defect. Among these attributes are the textual description of the defect, the as-
set attribute, which indicates the system part containing the defect, and the severity
attribute which indicates the highest impact the defect could or did cause. These
attributes are both utilized when assigning the task of removing the defect to a
team and when the assigned team investigates the location of the defect and the
corrective measures for removing it.

A vital part of a defect’s resolution is the task of defect localization. Defect lo-
calization is the task of finding the exact location of the defect in the system. This
task relies on the developer’s expertise of the system and ability to identify and
prioritize, during investigation, assets which may contain the defect [3]. The defect
report, in particular the asset attribute, should help the assigned entity to limit the
search space when investigating the exact location of the defect. However, research
has shown that oftentimes reporters initially assign values to these attributes that
provide incorrect information [4][5].

A study conducted at Microsoft by Guo et al. [6] showed that oftentimes the re-
porter of the defect does not have the required expertise for identifying the asset
containing the defect which is one of the main reasons behind the attribute being
assigned an incorrect value. The issue of incorrect values is also reflected in large-
scale organizations, other than Microsoft, with many mission-critical parts, where
a delay can have big impact on cost and customer perception. Furthermore, in or-
ganizations that have dedicated development teams, responsible for specific assets,
the defect assignment process is intertwined with the task of defect localization [7].
In such organizations an incorrect asset value may also delay the defect assignment.

1


1. Introduction

1.2 Problem Statement
The problem addressed in this thesis is that the asset attribute of a defect report
is often assigned an incorrect value. In large software organizations, it is common
that the reporter of the defect does not have the required expertise for identifying
the asset containing the defect. This results in the asset attribute being assigned
an incorrect value, which results in unnecessary delayed defect resolution which has
a large impact on maintenance costs, the speed of software development and the
quality of the final product.

1.3 Purpose of the Study
The purpose of this thesis to improve defect localization by classifying the correct
asset from which a defect originates by using machine learning. The intention is
to build a classification model, based on the results of previous research, in an
industrial context. In further detail, the aim is to reduce the defect localization
time by providing a more accurate recommendation of which asset that contains the
defect. This will aid the assigned entity to limit the search space when investigating
the exact location of the defect.

1.4 Limitations and Delimitations
Like every study, this study is also subject to limitations. One limitation is that
defect reports of a telecom company will be used. The attributes of the defect re-
port that will be considered is a subset of those specified by IEEE 1044 [8] that
are recorded upon initial documentation of a defect. These attributes include De-
scription, Artifact, Severity and Detection Activity and will be described in further
detail in Chapter 2. Therefore the study will only be reproducible given that the
previously mentioned attributes exists in the used defect reports.

One delimitation is that only data from defect reports will be used. Other re-
search in the topic has shown that data from other sources, such as version control
systems, can be utilized upon prediction [9]. These sources could contain useful
features, such as affected files and commit messages, for classifying the asset but
will not be considered during the study due to the time constraint of this thesis.

2


2
Background

2.1 Defects
According to [8] a defect is: "An imperfection or deficiency in a work product where
that work product does not meet its requirements or specifications and needs to be
either repaired or replaced.". When a defect is detected its attributes are classi-
fied and documented in a defect report. The attributes of a defect can be seen in
Table 2.1, where the names of the attributes might differ depending on the orga-
nization. Furthermore, the values of some attributes are added and changed over
time as the organization addresses the defect. The recorded attributes of a defect

Attribute Definition
Defect ID Unique identifier for the defect.
Description Description of what is missing, wrong, or unnecessary.
Status Current state within defect report life cycle.
Asset The software asset (product, component, module, etc.) containing the defect.
Artifact The specific software work product containing the defect.
Version detected Identification of the software version in which the defect was detected.
Version corrected Identification of the software version in which the defect was corrected.

Priority Ranking for processing assigned by the organization responsible for the evaluation,
resolution, and closure of the defect relative to other reported defects.

Severity The highest failure impact that the defect could (or did) cause, as determined by
(from the perspective of) the organization responsible for software engineering.

Probability Probability of recurring failure caused by this defect.
Effect The class of requirement that is impacted by a failure caused by a defect.

Type A categorization based on the class of code within which the defect is found or the
work product within which the defect is found.

Mode A categorization based on whether the defect is due to incorrect implementation or
representation, the addition of something that is not needed, or an omission.

Insertion activity The activity during which the defect was injected/inserted (i.e., during which the
artifact containing the defect originated).

Detection activity The activity during which the defect was detected (i.e., inspection or testing).
Failure reference(s) Identifier of the failure(s) caused by the defect.
Change reference Identifier of the corrective change request initiated to correct the defect.
Disposition Final disposition of defect report upon closure.

Table 2.1: The attributes of a defect © 1993 IEEE

serve both the reporter and the receiver, who is responsible for removing the defect.
Therefore defect reports are usually stored in a database known as Defect Tracking
System (DTS). The DTS is used by employees of the organization, both engineers
and managers, to understand the defect and follow it’s resolution process. When a
defect has been documented and classified, it is assigned to an entity, for instance

3


2. Background

a development team, that is capable of removing the defect. Once the corrective
measures has been taken to resolve the defect, the changes are added to a planned
release from which the defect is removed. The defect life cycle is depicted in Figure
2.1 and the relationship between defects and several conceptual entities is shown in
Figure 2.2.

Figure 2.1: The life cycle of a defect © 1993 IEEE

Figure 2.2: The relationship between defects and other entities in maintenance ©
1993 IEEE

At the telecom company where this study was conducted, defects are detected dur-
ing several stages of the life cycle of their products. Defects can for instance be
detected by end-users or during the internal testing of the product. When defects
are detected by an end-user, the unit for customer support is contacted. Customer
support investigates the reported problem and determines if the problem occurred
due to a defect in the product. If customer support determines that the problem
occurred due to a defect a defect report is created and submitted in the company’s
DTS.

Regardless of who the reporter is a web-portal is used to record the attributes of
and describe the defect. The web-portal looks similar to the one shown in Figure
2.3. The attributes shown in Figure 2.3 are the attributes that are mandatory to
record before submitting the defect report. The attributes are: Artifact, Detection
Activity, Severity, Asset and Description.

4


2. Background

Figure 2.3: Example view of when reporting a defect.

The life cycle of a submitted defect report at the telecom company can be seen
in Figure 2.4. When the defect report has been submitted it is assigned to team
capable of removing it. The assignment process varies depending of what type of
defect it is. Sometimes there are units responsible for certain assets of the product.
These units receive all the submitted defect reports regarding the asset that they
are responsible for and decide which team will be responsible for investigating and
removing the defect.

Figure 2.4: The life cycle of a defect report at the telecom company

Once the defect report has been assigned, it is investigated. The ideal outcome
of the investigation is that a corrective measure is suggested which leads to the re-

5


2. Background

moval of the defect. However, this is not always the case. Oftentimes, the attributes
of the defect are assigned incorrect values. If the incorrect values are detected by
the assigned team, the team will suggest changes in the defect report which can lead
to a reassignment of the defect report.

2.2 Machine Learning
The goal of Machine Learning is to develop computational functions that learn
through accumulated experiences [10]. The two most common methods of learning
are: Unsupervised Learning and Supervised Learning.

In unsupervised learning the goal is to learn structures of data sets without an
identified output[11]. For instance, given a data set containing 100 defect reports
originating from 5 different assets, the data needs to be grouped based on the as-
sets without any explicit indicator of which asset the defect originates from. In the
absence of labels, which are explicit indicators of what group each data point be-
longs to, the groups are learned by maximizing the similarities of the entries within
a group and minimizing the similarities between the groups. One of the reasons
behind the growth of the learning method is that, in certain problem domains, large
unlabeled data sets are available but labeling each entry would be time consuming
and require domain expertise [12]. For instance, when developing a recommenda-
tion system for a news-platform, a large number of articles would be available but
manually labeling each article as economy, sports or politics would be impractical.

The goal of supervised learning is to train a model to predict one or more outputs
for unseen inputs, given that there is a functional relationship between the inputs
and the outputs [13]. This is achieved by training the model with a labeled data set
which is a data set that contains correct outputs given certain inputs. According to
Louridas et al.[14] this can be compared to giving a student a set of problems with
their respective solutions for learning how to solve future unseen problems.

The output variables to be predicted can be either categorical variables or numerical
variables [13]. Categorical variables can be either nominal, ordinal or dichotomous.
Numerical variables can be either discrete or continuous and can take on values that
are obtained through measuring or counting. Descriptions of the different variable
types can be seen in Table 2.2.

Variable Type Description Example

Nominal A categorical variable that has more than two
possible values which does not have a natural order. Asset = {UI, Login, V alidator}

Ordinal A categorical variable which values have a natural order. Severity = {A,B,C}
Dichotomous Like a nominal variable with only two possible values. State = {Open,Closed}

Discrete A numerical variable that can take a value from
a set of whole numbers. #Lines Affected={1, 2, 3, ...}

Continuous A numerical variable that can take any value
between a set of real numbers. Time={2018-04-06 04:01, .... , 2018-04-06 04:01:02,....}

Table 2.2: Description of the different variable types

6


2. Background

The task of predicting an output variable varies depending on the type of the vari-
able. If the output is a numerical variable, as when trying to predict the number
of defects reported per week, the problem is a regression problem. However, if the
output is a categorical variable, as when trying to determine the severity of a defect,
the problem is a classification problem.

As the intention of this thesis is to build a model for determining the asset from
which a defect originates, which is a nominal variable, the presented algorithms will
be discussed in the context of classification. Furthermore, since all the provided
defect data is labeled only supervised learning algorithms will be presented.

2.2.1 Naive Bayes
Naive Bayes is a probabilistic classification algorithm that assigns an input with
the label that is most likely for that specific input [15]. The algorithm is based on
Bayes Theorem which can be seen in Equation 2.1 where P (A|B) can be read as
the probability of A given that event B has occurred.

P (A|B) = P (B|A)P (A)
P (B) = P (A ∩B)

P (B) (2.1)

When classifying the unseen entry seen in Table 2.3, Bayes Theorem would be used
as seen in Equation 2.2 where y is the classified label. Calculating the conditional
probability P (Login|A ∩ Customer) would be calculated as seen in Equation 2.3.

Severity Detection Activity Asset
A Customer -
A Customer Login
A System Test Validator

Table 2.3: Example of categorical defect report data with an unseen entry used
for a Naive Bayes model.

y = argmax P (Login|A ∩ Customer), P (V alidator|A ∩ SystemTest) (2.2)

P (Login|A ∩ Customer) = P (A ∩ Customer|Login)P (Login)
P (A ∩ Customer) (2.3)

Since Naive Bayes makes the naive assumption that all inputs are independent,
the calculation of the conditional probability is simplified as seen in Equation 2.4.
Furthermore the denominator is excluded from the calculation since it is constant
when comparing different conditional probabilities for the same inputs.

P (Login|A ∩ Customer) = P (A|Login)P (Customer|Login)P (Login)
P (A)P (Customer) (2.4)

7


2. Background

Now the classification task can be simplified as seen in Equation 2.5, where X is
the vector of inputs {x1, ..xn} and Ck is the target variable with the possible values
{C1, ...Cp}.

y = argmax
k∈{1,..p}

P (Ck)
n∏
i=1

P (xi|Ck) (2.5)

A Naive Bayes model, trained with the labeled data seen Table 2.3, would classify
the unseen entry as Login. This is because the entry has a greater likelihood of
belonging to class Login, as seen in Equation 2.6 and 2.7.

P (Login)P (A|Login)P (Customer|Login) = 0.5 · 1 · 1 = 0.5 (2.6)

P (V alidator)P (A|V alidator)P (Customer|V alidator) = 0.5 · 1 · 0 = 0 (2.7)
Naive Bayes offers fast training, implementation and classification [16][15]. However,
it usually has a lower accuracy in interdependent data sets, compared to other clas-
sification algorithms, due to the assumption of independence amongst the features.
Furthermore the model offers interpretable insights in both decision-making when
classifying and knowledge gained during training.

2.2.2 Logistic Regression
Logistic regression is a classification method borrowed from the field of statistics.
The method combines the logistic sigmoid function seen in Formula 2.8 and a linear
model for classifying entries using the model seen in Formula 2.9 [17].

f(x) = 1
1 + e−x

(2.8)

P (x) = 1
1 + e−(β0x+β1) (2.9)

When predicting a binary target variable, the output of the function can be seen as
P (x) = P (Y = A|X) where X is labeled A if P (X) > 0.5. During training the slope
β0 and intercept β1, seen in Formula 2.9, are decided by maximizing the likelihood
function seen in Formula 2.10.

`(β0, β1) =
∏
i:yi=1

p(xi)
∏

i′:yi′ =0
(1− p(xi′)) (2.10)

When using categorical variables each unique value adds another dimension. There-
fore, to provide an intuitive example, a different example than the example used for
describing Naive Bayes is provided. Given the data seen in Table 2.4, the logistic
regression model would learn the intercept and slope for the sigmoid function seen
in Formula 2.11. Furthermore a plot of the function together with the data used for
training can be seen in Figure 2.5. When classifying the unseen entry x seen in Table
2.4, the model would label the entry, using the learned function, as Overload = NO
since P (2.7) ≈ 0.19 < 0.5.

P (x) = 1
1 + e−(2.8329x+9.1050) (2.11)

8


2. Background

ID Avg Sent (Mbit/s) Overload
x 2.7 -
1 0.5 NO
2 1.5 NO
3 2 NO
4 4.5 YES
5 5 YES
6 3.5 YES
7 3 NO
8 3 YES
9 3.5 NO
10 4 YES

Table 2.4: Example of server metric data which are labeled with whether the server
is overloaded or not.

Figure 2.5: A Plot of the function seen in Formula 2.11 and the data points seen
in Table 2.4.

Similar to Naive Bayes, Logistic Regression provides interpretable classifications
since the output of the sigmoid function can be interpreted as a probability of class
membership. Furthermore Logistic Regression offers fast classification and makes
no assumptions of independence amongst the features. However, the algorithm is
limited to linearly separable data since the decision boundary can be seen as linear.

2.2.3 Decision Trees
Decision trees is a classification algorithm that uses a tree of logical decisions based
on the attributes of the data to predict a label [15]. Much like a flowchart, the data
to be classified starts at the root node and traverses down the tree based on the
attributes of the data. At each branch of the tree there is a decision node which
indicates a decision to be made based on a attribute. The leaf node, which is also
known as the terminal node, indicates the classified label of the data.

The data set shown in Table 2.5 consists of 4 entries containing categorical de-
fect report data. Assume this data was used to train a decision tree model with the
field Asset as the target variable. The resulting decision tree would look like Figure
2.6.
One of the main advantages of using decision trees is the interpretability of the
classifications [18]. Assume the decision tree is used to classify the entry with ID =

9


2. Background

ID Severity Detection Activity Asset
1 A Customer Login
2 B System Test Validator
3 B Customer Login
4 A System Test Load Balancer

Table 2.5: Example data set used for training a decision tree model

Figure 2.6: Example of a decision tree

3, the model would classify the asset as Login. If someone would ask why the
asset is classified as such the answer would be "because the Detection Activity is
Customer". This type of insight in the classification is not common amongst the
different classification algorithms and is viewed as one of decision trees strong suits.
Furthermore decision trees have a fast classification phase since an unseen entry
only traverses down the tree based on the path decided by the decision nodes, until
a terminal node is reached[15]. However, decision trees are both sensitive to noise
and redundant features[16] which can lead to overly complex trees.

2.2.4 Support Vector Machines
A Support Vector Machine (SVM) is a classification method that constructs a linear
boundary, called a hyperplane, that separates the entries by their labels into two
partitions [15] [19]. The class of an unseen entry is decided based on which partition
the entry is located in.

Similar to when using Logistic Regression, using categorical inputs adds another
dimension for each unique value of the variable. Therefore, the problem of classify-
ing if a server is overloaded is used as an example. The provided training data and
the unseen entry x can be seen in Table 2.6. Since the data is linearly separable, a
SVM would identify infinite number of possible hyperplanes during training. Three
of the possible hyperplanes, denoted as A, B and C, are shown in Figure 2.7.

10


2. Background

The problem of deciding between different possible hyperplanes is solved by search-

ID #Users (*1000) Avg Sent (Mbit/s) Overload
x 1.5 5 -
1 2 1 NO
2 1 2.5 NO
3 3 4 YES
4 4 3.5 YES
5 4 5 YES
6 1 1.5 NO
7 2 3 NO
8 3 3.5 YES
9 1.5 2 NO
10 3.5 4.5 YES

Table 2.6: Example of server metric data which are labeled with whether the server
is overloaded or not.

Figure 2.7: Example of possible hyperplanes for data linearly separable data seen
in Table 2.6.

ing for the Maximum Margin Hyperplane (MMH). The MMH is the hyperplane that
creates the greatest separation between the two classes and is most likely generalize
best to future unseen data [15]. Furthermore, the hyperplane is represented by a
set of support vectors, which are the points that lie on the hyperplane’s maximum
margin. Given the hyperplanes provided in Figure 2.7, the SVM would determine
that B is the MHH and use it as a boundary between the partitions. The coor-
dinates of the unseen entry x, seen in Table 2.6, would be (1.5, 5) which is above
the partition boundary B. Therefore the entry would be labeled as Overload = YES.

SVMs can also be applied to data that is not linearly separable. This can be done by

11


2. Background

using either a slack variable or the kernel trick. The slack variable allows misclassi-
fication of entries for a cost C per entry. Now instead of searching for the MMH the
goal of the algorithm is to minimize the total cost. The kernel trick is the process
of using kernel functions to transform the data to a higher dimension space, to trick
the algorithm to view the data as linearly separable. The kernel function adds new
features based on mathematical relationships of the existing features.

Overall SVMs have been used in a wide variety of domains where they have provided
highly accurate models. Furthermore, the classification phase of SVMs is considered
to be fast [16] [15]. This is because the label of an unseen entry is determined by its
relation to the MMH. However, finding the MMH comes with great computational
cost which slows down the training phase significantly.

2.2.5 Artificial Neural Networks
Artificial Neural Networks (ANNs) originate from attempting to represent the hu-
man brain as a mathematical model [20] [21] [15] and are used to solve a variety
of problems in both supervised and unsupervised learning. The artificial neuron as
seen in Figure 2.8 behaves much like a biological neuron. The neuron receives a set
of input signals denoted as xi which are multiplied by their respective weights wi
and summed to a single value. The total is then used by the activation function
denoted as f which results in the signal y(x). The formal definition of a artificial
neuron can be seen in Formula 2.12.

Figure 2.8: An artificial neuron

y(x) = f(
n∑
i=1

wixi) (2.12)

The main characteristics of a ANN is the activation function, the network topology
and the training algorithm. The activation function calculates the output of the
neuron which is forwarded to the other connected neurons. The most simple acti-
vation function is the threshold activation function. A typical threshold activation
function outputs a signal when the sum of inputs reach a certain threshold and does
nothing otherwise. However because of the discrete nature of a threshold activa-
tion function the most common activation function is the logistic sigmoid activation
function which shares a similar shape but is continuous.

12


2. Background

The network topology of an ANN is defined by the number of layers, the allowed
directions of dataflow and the number of nodes in each layer. The capabilities of an
ANN of finding subtle patterns in complex data sets is generally determined by the
size of the network and the composition of nodes. In ANNs data can be allowed to
flow backwards which can enable the network to learn patterns in sequences of events
over time. The weights of the inputs is determined during the training phase with
the use of some training algorithm with most common one being backpropagation.

2.2.6 K-Nearest Neighbors
Given a training data set, the Nearest Neighbor algorithm assigns an unseen entry x,
the same label as the training entry that is nearest given a distance metric d [15][11].
There are several distance metrics used for determining the nearest neighbor, such
as the manhattan distance and the euclidean distance for continuos features and the
hamming distance[22] for nominal features. To avoid misclassification in noisy do-
mains, more than one of the nearest neighbors are used for deciding the label which
is why the algorithm is most commonly known as k-Nearest Neighbors (kNN).

When classifying the label of entry x seen in Table 2.7, given the training examples
below x, the algorithm would:

1. Calculate the distance (hamming distance) between x and the training exam-
ples.

2. Identify the k entries that closest to x.
3. Identify the most common label c amongst the k entries.
4. Label x as c.

Assume k = 3, given the data in Table 2.7 that has been encoded to the data seen
in Table 2.8, the algorithm would identify the entries 1, 3 and 5 as the k-nearest
neighbors. The most common label amongst the nearest neighbors is Login, there-
fore the entry x would be labeled as Login.

ID Severity Detection Activity Asset
x C Customer -
1 A Customer Login
2 B System Test Validator
3 B Customer Login
4 A System Test Load Balancer
5 C Function Test Input Parser

Table 2.7: Categorical defect report data with unseen entry x

The k-Nearest Neighbors algorithm has a fast training phase but is slow when clas-
sifying new entries [15][16]. During the training phase, the algorithm does not learn
from the training data, instead it stores it in memory which makes the training
phase inherently fast. This is undesirable when dealing with large amounts of data

13


2. Background

ID A B C Customer System Test Function Test Asset Hamming Distance
x 0 0 1 1 0 0 -
1 1 0 0 1 0 0 Login 2
2 0 1 0 0 1 0 Validator 4
3 0 1 0 1 0 0 Login 2
4 1 0 0 0 1 0 Load Balancer 4
5 0 0 1 0 0 1 Input Parser 2

Table 2.8: The categorical defect report seen in Table 2.7 that has been encoded
using the one-hot method.

since the storage requirement scales with the number of training examples. The
slow classification of an unseen entry is due to the great computational cost that
comes with calculating the distance between the entry and all training examples,
and finding the k examples with the shortest distance to the entry.

2.2.7 Performance Evaluation of Classification Models
There are several ways of evaluating the performance of a trained classifier. The
most common measures of performance is the error rate and accuracy [18]. The
accuracy of a trained classifier is derived by dividing the correctly labeled entities
with the total number of labeled entities. The error rate is derived by dividing the
incorrectly labeled entities with the total number of labeled entities. This metric
works well for balanced data sets however, it is not an accurate measure of perfor-
mance for unbalanced data sets.

Assume that we are predicting a output variable X. The variable can take on two
values, A and B. The distribution of the data is unbalanced, 98% of the values are
labeled A and 2% of the values are labeled B. A classifier that always predicts A
would have an accuracy of 98% however most people would agree that this classifier
is of no use.

If the class distribution is unbalanced, the accuracy metric needs to be comple-
mented or substituted by other metrics, such as precision and recall. Precision is
calculated by dividing the number of correct predictions of a class by the number
of times the class was predicted[18]. The precision formula for a class can be seen
in Formula 2.13, where true positives are denoted as TP and false positives as FP .
Futhermore the precision metric of a class A, can be interpreted as the probability
of the classifier being correct, when labeling an unseen entry as A.

Precision = Pr = TP
TP + FP

(2.13)

The precision of a class A, given a data set with 98 entries labeled B, 2 entries labeled
A and a classifier that always labels entries as B, except one correct prediction of
A, would be 1. Therefore the precision measure is complemented with the recall
metric, which measures the probability of the classifier recognizing the class A given
an instance of the class A. Recall is calculated by dividing the number of correct

14


2. Background

predictions of the class by the number of instances of the class[18]. The formula of
the recall metric can be seen in Formula 2.14, where false negatives are denoted as
FN . The measures recall and precision can be combined into a single metric called
the F1-score, which is the harmonic mean of both measures. The formula of the
F1-score can be seen in Formula 2.15.

Recall = Re = TP
TP + FN

(2.14)

F1 = 2 · Pr ·Re
Pr +Re

(2.15)

To aggregate a metric that has been measured for all classes, to a single value, two
different methods of averaging can be used, namely micro-averaging and macro-
averaging. The micro-average of a metric is calculated by dividing the sum of
numerators with the sum of the denominators of the measurements for each class.
An example of micro-averaging the precision metric for the classes A and B, can
be seen in Formula 2.16 given the measures seen in Formula 2.17. This averaging
method is biased towards the most populated class.

Pr = TPA
+ TPB

TPA
+ FPA

+ TPB
+ FPB

(2.16)

PrA = TPA

TPA
+ FPA

PrB = TPB

TPB
+ FPB

(2.17)

The macro-average is calculated by dividing the sum of the class metrics by the
number of classes. Given the measures seen in Formula 2.17, the macro-average of
the precision metric for the classes A and B would be expressed as seen in Formula
2.18. This averaging method is useful for unbalanced data sets since it weighs each
class equally.

Pr = PrA + PrB
2 (2.18)

Another metric that has been used to evaluate the performance of classification
models, on unbalanced data sets, is the Matthews Correlation Coefficient (MCC)
[23][24]. The metric’s value ranges between -1 and 1. The value indicates the degree
of correlation between the predictions and the results, where 1 indicates complete
correlation, 0 indicates no correlation and -1 indicates negative correlation. The
formula for the Matthews Correlation Coefficient can be seen in Formula 2.19.

MCC = TP · TN + FP · FN√
(TP + FP )(TP + FN)(TN + FP )(TN + FN)

(2.19)

Furthermore the Matthews Correlation Coefficient has been generalized to cover
cases where the target variable is non-binary. Given a K ×K Confusion Matrix, as

15


2. Background

seen in Formula 2.20, where the rows indicate the predicted class and the columns
indicate the actual class, the metric can be formulated as seen in Formula 2.21.

K ×K Confusion Matrix =


C11 C12 . . .
... . . .
Ck1 Ckk

 (2.20)

MCC =
∑
k

∑
l

∑
mCkkClm − CklCmk√∑

k(
∑
l Ckl)(

∑
k′|k′ 6=k

∑
l′ Ck′l′)

√∑
l Clk)(

∑
k′|k′ 6=k

∑
l′ Cl′k′)

(2.21)

2.2.8 Pitfalls in Supervised Learning
When developing a classification model there are a number of pitfalls that needs to
be taken into consideration. This section will describe the different pitfalls and how
to avoid them.

2.2.8.1 Performance Validation of Classification Models

As the goal of a classification model is to adequately classify a target variable for
unseen entries, the classifier should be properly evaluated. Other than deciding ap-
propriate performance measures for the problem domain it is important to consider
how the measures translate beyond the training data. Evaluating the model on the
training data is misleading since performing well on the training data is easy [25].
For instance, a classifier that memorizes the training data will have a accuracy of
100% when evaluated on the training data. Therefore, it is important to evaluate
the model on data that has not been used for training the model. This is done by
splitting the data set into a test set and a training set so that the model is evaluated
on unseen data.

To avoid the influence of the selection of the test set k-Fold Cross-Validation is used.
In k-Fold Cross-Validation the data is split into k subsets of same size which is used
by the learning algorithms k times. Lets denote the set of subsets {x1, x2, ..., xk} as
X. For each run, from i = 1 to i = k, the algorithm uses all subsets {xj ∈ X|j 6= i}
for training and xi for testing. The final score, used for evaluating the model, is the
average of all test scores. Furthermore, if the data set share the same class distri-
bution as the data encountered in the problem domain, randomly splitting the data
can yield a pessimistic performance estimate [26]. In such cases, each fold can be
stratified to maintain the same class distribution as the entire data set. For instance,
when splitting a data set that contains 90 instances of class A and 10 instances of
class B into 10 stratified folds, each fold would contain 9 instances of class A and 1
instance of class B. Using stratified folds with K-fold Cross-Validation is known as
Stratified K-Fold Cross-Validation.

2.2.8.2 Missing Attribute Values

In many data sets the values of some attributes are missing, which can cause prob-
lems for certain classification algorithms. For instance, given an entry with one or

16


2. Background

more missing values, an artificial neuron will fail to calculate the input of the acti-
vation function ∑

iwixi. The entries with missing attribute values can be removed,
but in some data sets this will result in removing a majority of the data set [18].
The other option is to fill in the missing values which can be done using different
strategies[11].

One strategy is to decide the value based on the observed values of the attribute.
This can be done by choosing the most frequent value, the average of all values or a
random value chosen from the attribute’s distribution. However, this can be mislead-
ing especially in cases where the attributes are dependent. For instance, when given
a defect report with the values Status = Open where Status ∈ {Open,Closed}
and Action = ? where Action ∈ {Pending, F ixed}. Setting the Action attribute to
Fixed would confuse the model as this data point would be considered as noise.

Another strategy is to train a model to determine the values of the examples with
missing attribute values. This is done training a model with all examples that have
values for the attribute of interest. When the model has been trained, the values of
the examples with missing values are determined by the model. This adds another
layer of complexity since each attribute with missing values adds a new machine
learning problem.

2.2.8.3 Class Imbalance Problem

A data set where the distribution of classes is unbalanced is said to suffer the Class
Imbalance Problem [11]. For instance, when given a training set for classifying ma-
licious network requests, the class distribution will most likely be unbalanced. As
the imbalance of the class distributions grows, the accuracy when classifying the
minority class decreases because of the model’s bias towards the majority class[18].

Given a classification task where the goal is to maximize the accuracy and where the
provided training and test data share the same class distribution, the class imbal-
ance problem is not deemed as meaningful[11]. However, when classifying malicious
network requests, the minority class is of interest. Therefore the model needs to
be evaluated with other metrics, that highlights the classifying performance of the
minority classes, such as MCC or F1-score.

Substituting the accuracy metric with MCC or F1-score only helps to properly eval-
uate the model and detect the model’s biases towards different classes. However,
to mitigate the class imbalance problem the training data needs to be re-sampled.
There are two common methods for re-sampling the training data, namely under-
sampling and oversampling[18].

When undersampling the training data, examples from the majority class are re-
moved from the data set until the class distribution is balanced. The examples that
should be removed can be selected at random or by identifying noisy instances of
the majority class. However, in some cases the training set is small and therefore
undersampling the data set is impractical. In such cases oversampling is used, where

17


2. Background

more examples from the minority class are added to the training set to balance the
class distribution. The new examples are created by copying existing examples from
the minority class. The copies are then either kept as is or minor modifications are
made to the continuous attributes of the copy.

2.2.8.4 Overfitting

Overfitting is when the model fits well to the training set but fails to generalize to
unseen data[27]. This is due the model trying to describe all the data points rather
than the underlying distribution of the data [11] as seen in Figure 2.9. A common
approach for avoiding overfitting is regularization.

Figure 2.9: Example of a complex overfitted model.

In flexible algorithms such as Logistic Regressors and SVMs, the parameters that
maximizes the training score are selected during training. This favours overly com-
plex models as seen in Figure 2.9. To penalize complex models, such as SVMs and
Logistic Regressors, regularization is used during training. Regularization is the
method of using a regularizer, which quantifies the complexity of a model, together
with the performance measures when selecting a model.

2.2.8.5 Feature Engineering

Feature engineering is the process of constructing, extracting and selecting features.
Features are the characteristics of an object, that are representations of or calcula-
tions made with the object’s attributes. Consider the data seen in Table 2.9, where
the nominal attribute Detection Activity is used as feature without any modifica-
tions. Training a SVM with the given data would not be possible, since the inputs
cannot be placed in a numerical space. Therefore the attributes of an object, has to
be represented in a way that can be interpreted by the learning algorithm.

18


2. Background

ID Detection Activity Asset
1 Customer Login
2 Function Test Validator

Table 2.9: Example of nominal defect data

A common way of constructing features from nominal attributes is One-Hot en-
coding. One-hot encoding creates a new dichotomous feature for each possible value
of the attribute, where the value 1 indicates the presence of that value. One-hot
encoding the Detection Activity attribute would yields the features seen in Table
2.10, which now can be used for training a SVM.

ID Function Test Customer Asset
1 0 1 Login
2 1 0 Validator

Table 2.10: Example of features constructed from nominal defect data from Table
2.9 using one-hot encoding

Now consider the data seen in Table 2.11, where the attribute Description contains
human-written text. Treating the attribute as a nominal attribute and constructing
features from it using one-hot encoding would yield the data seen in Table 2.12.
When calculating the hamming distance between the unseen entry x and other en-
tries, the resulting distance would be the same. This is because the similarities
between the texts are not quantified, either the texts are identical or not. Therefore
text is usually represented as a bag of words [11], which is a feature vector that
describes the occurence of each word in the text.

ID Description Asset
x Login fails -
1 Login not working Login
2 Validate function incorrect Validator

Table 2.11: Example of human-written defect description.

ID Login not working Validate function incorrect Asset
x 0 0 -
1 1 0 Login
2 0 1 Validator

Table 2.12: Example of features constructed from human-written text using one-
hot encoding

A simple bag of words representation is constructed by first creating a vocabu-
lary. The vocabulary is constructed by collecting the words of all entries used for

19


2. Background

training into a set. The set will now contain all the distinct words found in all
documents. After the vocabulary has been constructed, the number of occurrences
of each word in the vocabulary is counted for each entry. The resulting vectors
are then used as a representation of each document. Applying this technique to
the data seen in Table 2.11 would yield the features seen in Table 2.13. Now when
calculating the hamming distance, the closest entry to the unseen entry x is entry 1.

ID Login not working Validate function incorrect Asset
x 1 0 0 0 0 0 -
1 1 1 1 0 0 0 Login
2 0 0 0 1 1 1 Validator

Table 2.13: Example of features constructed from human-written text using a
simple bag of words model

To highlight the words that distinguish each textual entry, also known as docu-
ments, a method called Term Frequency - Inverse Document Frequency (TF-IDF)
can be used. Similar to the simple bag of words model, features are constructed
by first creating a vocabulary. After the vocabulary has been constructed, the fre-
quency of each word in the vocabulary is calculated for each document. The term
frequency is calculated by using the function seen in Formula 2.22 for each word and
document where w denotes a word and d denotes a document.

tf(w, d) = # occurences of w in d
# words in d (2.22)

To weight each word based on the occurrence across all documents, the inverse
document frequency is calculated for each word. The inverse document frequency is
calculated by using the logarithmic function seen in Formula 2.23 where w denotes
a word. By using a logarithmic function the words that occur frequently across all
documents receive a weight close to zero and the words that occur rarely receive a
higher weight.

idf(w) = # number of documents
# number of documents containing w (2.23)

Each position of the final feature vector, which denotes a word in the vocabulary, is
derived by multiplying the term frequency with the inverse document frequency for
each document. The complete function for TF-IDF can be seen in Formula 2.24.

tf − idf(w, d) = tf(w, d) · idf(w) (2.24)

20


3
Related Work

Previously no research has been conducted in predicting the asset attribute of a
defect report. However, using machine learning for predicting other defect attributes
and improving the process of defect resolution is not uncommon. In the systematic
mapping study conducted by Cavalcanti et al. [28], it was reported that most
research on the the topic of defect classification is centered on defect assignment
and duplicate detection. Furthermore, Cavalcanti et al. noted that previous research
could be extended to classify other defect attributes, such as asset and severity, that
are recorded manually. This could be helpful for less experienced reporters when
reporting a defect.

3.1 Defect Assignment

Automating the task of defect assignment is a well-researched topic. In a systematic
review [29], 75 papers researching the topic was reviewed where one of the most
common approaches of automating defect assignment being machine learning.

Previous studies have evaluated several classification techniques with the most com-
mon evaluation metric being accuracy [29]. The reported accuracies of individual
classifiers range from 25 % [30] to 64 % [31] with a higher accuracy being reported by
with the use of meta-algorithms [32][7]. The most common features are constructed
from the descriptions of defect reports by using TF-IDF [29]. However categorical
attribute such as Asset [31][30][33][32][34], Artifact [30][31], Submitter [35][34] and
Detection Activty [34] are also used.

When evaluating the developed classification model Banitaan et al. [33] reported
that the features constructed from the asset attribute were the most influential fea-
tures for three out of four data sets that were used. Similar findings were reported
by Annvik et al. [31], who improved the accuracy of their classification model for
Eclipse by 48%, when using features constructed from the asset attribute. However,
both papers validated the performance of their classification models using resolved
defect reports. Therefore, the effect of the asset attribute being assigned an incorrect
value was never observed.

21


3. Related Work

3.2 Duplicate Defect Report Detection

A duplicate defect report describes a defect that already has been reported and
submitted into the DTS. Duplicate defects are not wanted in the DTS since inves-
tigating it would inherently mean that two different assigned entities are doing the
same work. Therefore, organizations often have designated staff investigating the
incoming defect reports for duplicates [28].

To automate the process of detecting duplicate defect reports several methods have
been proposed utilizing the description of the defect report to measure the similarity
between defect reports [36][37][38][39]. By using TF-IDF to construct features from
the defect description Jalbert et. al [37] developed a model that managed to filter
out 8 % of the duplicate defect reports before reaching the DTS. With a detection
rate of 8%, the model could not replace the manual labour required for detecting
duplicates. However the model could reduce the number of defect reports to be
inspected manually. Since the model’s detection rate of non-duplicates was 100 %,
the only cost of deploying the model would be the cost of classifying each defect re-
port. The time required for classifying a defect report was reported to be 20 seconds.

Tian et al. [39] developed a classification model with a detection rate of 24 %
by also including features constructed from the categorical attributes of a defect
report. However, the improvement of the detection rate led to a loss of 9 % of the
non-duplicate detection rate.

3.3 Text Classification

Text classification is the task of organizing unlabeled documents into categories.
In a paper about text classification with SVMs, Thorsten [40] identified some of
the challenges of text classification. Even when removing the stop words of a text
corpus the number of remaining words will still be a considerable amount. This
high dimensional input space can lead to problems with generalizability for some
algorithms. A technique for reducing the dimensions of the input space is to remove
features that are irrelevant. However, Thorsten [40] noted that few of the features
constructed from the words of the corpus are irrelevant and therefore removing the
features would result in information loss.

The algortihms discussed in the Background section namely Naive Bayes, Logis-
tic Regression, Decision Tree, Support Vector Machines, Artificial Neural Networks
and K-Nearest Neighbors can all be used for text classification [41]. However, ac-
cording to Thorsten [40] most text classification problems are linearly separable
which benefits algorithms using linear functions to separate the categories. There-
fore, linear SVMs, Naive Bayes and Logistic regression are common algorithms used
for text classification. However, as stated by the No Free Lunch Theorem [11] there
is no one algorithm that works best for every problem.

22


3. Related Work

3.4 Defect Prediction In Industry
Prediction models in the domain of software defects are widely researched where
most of the previously presented studies are related to defects in open source soft-
ware. However, in the domain of large-scale software projects in industry, research
has also been conducted. Among the problems studied in this domain are: Defect
Inflow Prediction [42][43] and Software Defect Prediction [44].

Defect Inflow Prediction is the task of predicting the number of non-redundant
defects being reported into the DTS [43]. In study conducted by Staron et al. [43] a
model for predicting the defect inflow during the planning phase for up to 3 weeks in
advance was proposed. The model was constructed using multivariate linear regres-
sion and modeled the defect inflow as a function of characteristics of work packages.
The results showed that the model could support project managers to estimate the
work effort needed for completing the project by providing a prediction accuracy of
defect inflow of 72%.

To make the testing phase more efficient a Software Defect Prediction (SDP) model
can be used. SDP is the task of predicting software assets which are prone to defects.
By using a SDP model organizations can make testing more efficient by allocating
more resources to the predicted assets [44]. Predicting assets prone to defect can be
done using machine learning. Rana et al. [44] highlights that the problem can be
either a classification or a regression problem.

A classification model for SDP classifies modules, that are represented by software
metrics and code attributes, as fault-prone or non-fault-prone based on previous
projects [45]. For this task several algorithms including Naive Bayes, Logistic Re-
gression, Decision Trees, Support Vector Machines, Artificial Neural Networks and
K-Nearest Neighbors has been used and shown significant results. However, the
comparative study conducted by Lessmann et al. [45] concluded that the classifi-
cation algorithm had little importance when comparing the performance of the 17
most accurate classifiers that were studied.

Even though Machine Learning has been used to developing models for SDP, Rana
et al. [44] identified that the adoption of Machine Learning for SDP in industry
has been limited. The study showed that attributes other attributes than predictive
accuracy such as cost, reliability and generalizability affect the willingness to adopt
the technology. To accelerate the adoption of Machine Learning in industry, Rana
et al. [44] developed a framework for comparing Machine Learning based techniques
to existing systems. Using the framework would help industry to make informed
decisions and reflect on their strengths and areas of improvement with respect to a
given technology.

23


3. Related Work

24


4
Research Design

4.1 Research Questions

The aim of this research is to develop a model for classifying the asset from which a
defect originate using historic defect reports. Furthermore, the model is evaluated
using metrics that are appropriate for the given data set and by measuring the po-
tential time-savings at a large telecom company. This leads to the main research
question:

RQ 1: How can the source asset of a given defect report be identified using machine
learning?

In order to develop a classification model for classifying the asset from which a
defect originates, qualitative features need to be constructed. This leads to the first
sub question:

RQ 1.1: Which attributes of the defect reports can be used for constructing fea-
tures?

When the available attributes have been determined, the initial set of features are
constructed. However, some features might not contain information that describes
the functional relationship between the defect report and the asset which might de-
crease the performance of the classification model. Therefore, a suitable feature set
needs to be determined. This leads to the second sub question:

RQ 1.2: Which set of features provides the best results using k-fold cross-validation?

This question is addressed by evaluating a number of supervised classification algo-
rithms with a combination of different subsets of features. The outcome will show
which combination provides the most accurate classifications.

The answers of the sub questions will be used to answer the main research question.
In further detail, the answers will provide a classification model for classifying the
asset from which defect originates, which will be evaluated by measuring accuracy,
recall, precision, F1-score, the Matthews Correlation Coefficient.

25


4. Research Design

4.2 Research Methodology

The research conducted followed the design science research methodology seen in
Figure 4.1. The research started by analyzing the defect reports of the organization
which confirmed that, similar to Microsoft [6], the asset attribute oftentimes were
assigned incorrect values. This identified the need of aiding the defect reporter by
recommending a value for the asset attribute.

Figure 4.1: Design science research methodology as explained by Hevnet et al.
[46]

To gain applicable knowledge for developing a classification model, capable of clas-
sifying the asset from which a defect originates from, a literature review was con-
ducted. The outcome of the literature review was information about classification
algorithms to consider, the process of developing and evaluating a model, previously
applied techniques for predicting defect attributes and the life cycle of a defect in
the organization which is presented in Section 2 and 3.

Given the organizations data set, the possibilities of using machine learning for
predicting the asset attribute, to justify the development of a classification model,
was evaluated. To get familiar with the data set and it’s attributes data exploration
and mining techniques were used. The data exploration showed that the historic
defect reports were labeled with the asset from which the defect was removed and
could therefore be used to train a classification model.

After the previous steps, the development and evaluation of different algorithms,
together with different subsets of features started.

26


4. Research Design

4.2.1 Data set

The data provided by the organization contained defect reports, submitted by both
customers and employees, starting from 2010, grouped by two different products.
The two products, denoted as Product 1 and Product 2 in this thesis, are mature
products built by several million lines of code with a few hundred active developers
each. Furthermore, the products are deployed internationally and each have more
than 10,000 submitted historic defect reports.

The data set containing the historic defect reports was imported into a table to
act as a snapshot. Only attributes that were mandatory to record when creating a
defect report were included in the data set to avoid missing data. The mandatory
attributes were: Description, Severity, Detection Activity, Artifact and Asset. These
attributes were a subset of the attributes presented in Table 2.1. Furthermore, the
description of the mandatory attributes with their respective variable types can be
seen in Table 4.1.

Attribute Type Description
Description Textual Description of what is missing, wrong, or unnecessary.

Severity Ordinal The highest failure impact that the defect could (or did) cause, as determined by
(from the perspective of) the organization responsible for software engineering.

Detection Activity Nominal The activity during which the defect was detected (i.e., inspection or testing).
Artifact Nominal The specific software work product containing the defect.
Asset Nominal The software asset (product, component, module, etc.) containing the defect.

Table 4.1: The mandatory attributes of the defect reports that were provided by
the telecom company.

Defect reports which were not addressed or resolved were removed since they could
not have been used during training or validation. This was because the asset
attribute of the defect reports could be reassigned during the resolution process.
Therefore, the assigned value could be incorrect since the value was not final. Fur-
thermore the defect reports that described failures which were not caused by a defect
were removed. These failures could, for instance, be caused by not following the doc-
umentation when configuring the system. Since the reported failure was not caused
by a defect, it would not result in a corrective measure in an asset. Therefore, the
assigned value of the asset attribute could be considered incorrect. A lower bound
for the number of possible assets for each product can be seen in Table 4.2.

Product #Assets
Product 1 >40
Product 2 >100

Table 4.2: The number of possible assets for each product.

27


4. Research Design

4.2.2 Development Setup

The development and evaluation was performed on a laptop running Windows 7.
The limited performance affected the time required to run the experiments. If, for
instance, the experiments were conducted on a mainframe computer, each iteration
of the 10-Fold Cross-Validation could have been run in parallel. This would have
been useful when evaluating algorithms with a slow training phase, such as SVMs,
as the computer is unusable during training. Furthermore, this limited the possibil-
ities of tuning each algorithm since each tuning task requires exhaustive search over
the parameter values. The optimal value is decided by comparing the score of each
parameter value using cross-validation on the training set.

The framework that was used for feature engineering and machine learning tasks
is Scikit-learn [47]. Scikit-learn offers a vast library of machine learning algorithms
and tools for both feature engineering and model evaluation. The selection of the
framework was made based on the ease of use, community support and familiarity
with Python.

4.2.3 Algorithms

The classification algorithms evaluated for constructing a classification model can be
seen in Table 4.3. The set of algorithms was decided based on the usage in related
research. As stated by the No Free Lunch Theorem [11] there is no single algorithm
that works best for every problem and therefore the most suitable algorithm were
decided through the conducted tests. Furthermore, a neural network implementation
was excluded due to the long training phase and the large number of parameters
needed to be tuned.

Classification Algorithm Description

MultinomialNB [48] Naive Bayes implementation for
multinomially distributed data.

DecisionTreeClassifier [49]
A decision tree implementation for
classification that uses a optimized
version of CART [50].

LogisticRegression [51] A Logistic Regression implementation
for classification.

KNeighborsClassifier [52] A K-Nearest Neighbors implementation
that uses five neighbors as default.

LinearSVC [53]
A SVM for classification that
uses a linear kernel with the cost
parameter set to 1 as default.

Table 4.3: Descriptions of the evaluated classification algorithms.

28


4. Research Design

4.2.4 Feature engineering
From the mandatory defect attributes seen in Table 4.1 a set of features were con-
structed. To represent the textual attribute Description as features, the attribute
was transformed into a feature vector by using TF-IDF. The TF-IDF implementa-
tion of Scikit-learn [54] also offered removal of stop words and tokenizations of words
using regular expressions. This was utilized to remove all the stop words defined
by Scikit-learn [55] and tokenize the words so that they only contained at least one
alphabetical character using the following pattern: (?ui)\\b\\w ∗ [a− z] + \\w ∗ \b.
The stop words and numerical sequences were removed since they were irrelevant for
describing the attribute. The 1000 words which had the highest TF-IDF score were
selected since a higher number of words would have increased the training time of
the model. However, the number of words that are selected from a TF-IDF vector
is a parameter that needs to be tuned which is not something that is considered in
this thesis. The categorical attributes of the defect reports were transformed into
features by using Scikit-learn’s algorithm for One-Hot encoding [56].

4.2.5 Iteration 1
The first iteration evaluated the classification scheme seen in Figure 4.2. Using
this scheme, the reporter would not assign the asset attribute a value. Instead, the
trained classification model would decide the asset, based on the attributes recorded
by the reporter. To decide which attributes the model should use when classifying
the asset three tests were conducted. The three tests were:

1. Learning with features constructed from the categorical attributes
2. Learning with features constructed from the textual attributes
3. Learning with a combination of the features used for the previous tests

Furthermore, to evaluate which algorithm provided the best result, each test was
conducted with the same set of classification algorithms, described in Table 4.3.

Figure 4.2: Classification scheme for Iteration 1

The metrics used for evaluation during the tests were Accuracy, Precision, Recall,
F1-Score and Matthews Correlation Coefficient. These metrics were used to measure
how the classifier performs across all classes, which the accuracy measure fails to

29


4. Research Design

do in unbalanced data sets. However, while these measures show the significance of
the classifier the organization has determined that accuracy is the most important
metric.

To validate the measures of each classification model, 10-fold cross-validation with
stratified folds was used which were described in Chapter 2. This method of vali-
dation has shown to provide a better estimation of performance than lower values
of K [26] while higher values of K come at higher computational cost and bias.
Furthermore, stratified fold were used to reduce the variance between the different
cross-validation iterations and to maintain the class distribution upon validation.

Three benchmarks were created using three different setups, described in Table
4.4, of Scikit-learn’s DummyClassifier. These setups represent three different clas-
sifiers, that do not learn anything about the underlying relationships between the
features and the target variable but uses the class distribution of the training data
to classify the asset of given defect report. The benchmarks values, seen in Table
4.5 for Product 1 and Table 4.6 for Product 2, were used as lower bound for eval-
uating the possibilities of using machine learning to classify the asset from which a
defect originates. If the performance of the classification model did not exceed these
benchmarks then it had not gained any knowledge about the functional relationship
between the features and the asset and was therefore not considered to be useful by
the telecom company.

Setup Description

stratified Classifies entries based on the distribution of
the label class in the training set.

most_frequent Classifies entries based on the most frequent
label in the training set.

uniform Classifies entries uniformly at random.

Table 4.4: Descriptions of the different setups of the DummyClassifier used in this
thesis.

The benchmark values, presented in Table 4.5 and Table 4.6, shows that there is no
correlation between the models classifications and the actual assets since the MCC
is equal to or less than zero. The highest accuracy is achieved by always classifying
the asset as the most frequently occurring asset in the training data for a given
defect report. The models using this strategy, correctly classifies the asset of 13.1%
of the defect reports for Product 1 and 9.4% of the defect reports for Product 2.
However, it results in a macro-averaged precision and recall close to zero since the
models fail to classify any other asset other than the most frequent one. Using any
other strategy results in a lower accuracy which in most cases is close to zero.

30


4. Research Design

Setup Accuracy Precision (Macro) Recall (Macro) F1-score (Macro) MCC
stratified 8.01% 2.2% 2.44% 2.13% -0.001
most_frequent 13.1% 0.3% 2.33% 0.54% 0.0
uniform 2.37% 2.26% 2.54% 1.65% -0.001

Table 4.5: The benchmark values by setup for Product 1.

Setup Accuracy Precision (Macro) Recall (Macro) F1-score (Macro) MCC
stratified 2.85% 0.8% 0.81% 0.78% -0.002
most_frequent 9.4% 0.08% 0.85% 0.15% 0.0
uniform 0.79% 0.83% 0.93% 0.63% -0.001

Table 4.6: The benchmark values by setup for Product 2.

4.2.6 Iteration 2
The second iteration evaluated the classification scheme seen in Figure 4.3. The first
iteration did not make use of the reporter’s expertise. Instead of deterministically
classifying the source asset of a given defect report, the model in the second iteration
would provide the reporter with a list of likely assets for the recorded attributes of
the defect report. This would leave the final selection of the source asset to the re-
porter who would use their expertise to select an asset from the list of recommended
assets. Furthermore, the recommendation system could also aid the reporters who
have little to no expertise since the list of possible assets would be shortened.

Figure 4.3: Classification scheme for Iteration 2

Similar to the first iteration, three tests were conducted to decide which attributes
the model should use when classifying the recommendations. The three tests were:

1. Learning with features constructed from the categorical attributes
2. Learning with features constructed from the textual attributes
3. Learning with a combination of the features used for the previous tests

Furthermore, to evaluate which algorithm provided the best result, each test was
conducted with the same set of classification algorithms, described in Table 4.3. As
the performance metrics for recommendation systems differ from metrics used for
classification models, different metrics than the metrics used for Iteration 1 were
used. The metrics used for evaluation during the tests were Recall@n and Preci-
sion@n. Recall@n is the previously presented recall metric extended to recommen-

31


4. Research Design

dation systems and is calculated by using the formula seen in Formula 4.1 where n
denotes the number of recommended items.

recall@n = |relevant recommended items @n|
|relevant items| (4.1)

Calculating the Recall@3 for the recommendations seen in Table 4.7 results in a
value of approximately 0.667 = 66.7%. This is because the number of relevant
recommended assets is 2 and the number of relevant assets is 3 across all recommen-
dations.

Id Recommended Assets Correct Asset |Relevant Recommended Assets| |Relevant Assets|
1 [Login, Validator, Load Balancer] Login 1 1
2 [Decoder, Login, Load Balancer] Validator 0 1
3 [Router, Validator, Decoder] Router 1 1

Table 4.7: Example of recommendations provided by a system that recommends
three assets for a given defect report.

The precision@n metric is the precision metric extended to recommendation systems
and is calculated by using the formula seen in Formula 4.2.

precision@n = |recommended relevant item @n|
|recommended items| (4.2)

Since each defect report only has one correct asset the accuracy metric equals the
recall@n metric. Furthermore, the number of recommended items, used by the pre-
cision@n metric, can be simplified to n×|relevant items|. Therefore the precision@n
measure can be simplified to the formula seen in Formula 4.3.

precision@n = recall@n
n

(4.3)

Using the Recall@3 calculated for the recommendations seen in Table 4.7, results in
the Precision@3 which can be seen in Formula 4.4.

precision@3 = recall@3
3 ≈ 0.222 = 22.2% (4.4)

Similar to the first iteration, 10-fold cross validation with stratified folds was used
to validate the measures of each classification model. Furthermore a benchmark
was developed to emulate a recommendation system that uses the class distribution
of the training data to provide recommendations. The benchmark values for each
product can be seen in 4.8.

Product Recall@1 Recall@3 Recall@5 Recall@10
Product 1 13.1% 36.75% 52.15% 78.71%
Product 2 9.4% 20.16% 26.66% 40.16%

Table 4.8: The benchmark values of the recommendation system for each product.

32


5
Results

In this chapter the results of each iteration is presented. Section 5.1 presents the
results of Iteration 1 which describes the performance of classification models trained
with three different feature sets. Section 5.2 presents the results of Iteration 2 which
describes the performance of recommendation models trained with three different
feature sets. Furthermore, in Section 5.2 the feature set with the best performing
models was used to create charts which shows the balance between precision and
recall for each classification model.

5.1 Iteration 1

This section presents the results of the first iteration. The first iteration aimed to
evaluate which set of features and classification algorithm provided the best perfor-
mance when constructing a classification model for classifying the asset from which
a defect originates. For each feature set the classification model with the maximum
accuracy is bolded. To complement the accuracy metric, metrics such as macro-
averages of Precision, Recall and F1-Score are provided. Furthermore, to show the
correlation between the predictions and the results the MCC metric has been pro-
vided.

Table 5.1 and 5.2 shows the performance of the classification models trained with
features constructed from the categorical attributes of defect reports from Product
1 and Product 2 respectively. For Product 1 the maximum accuracy of 26.82% is
achieved by using LogisticRegression. For Product 2 the maximum accuracy of
29.56% is achieved by using LinearSVC.

Classifier Accuracy Precision (Macro) Recall (Macro) F1-score (Macro) MCC
MultinomialNB 26.28% 7.47% 8.17% 7.29% 0.186
DecisionTreeClassifier 25.93% 8.62% 8.36% 7.88% 0.183
LogisticRegression 26.82% 7.87% 8.23% 7.31% 0.191
KNeighborsClassifier 21.81% 8.45% 7.52% 7.15% 0.147
LinearSVC 26.43% 7.48% 8.27% 7.24% 0.185

Table 5.1: Results for models trained with features constructed from the categorical
attributes of defect reports from Product 1

33


5. Results

Classifier Accuracy Precision (Macro) Recall (Macro) F1-score (Macro) MCC
MultinomialNB 28.72% 11.33% 13.77% 10.61% 0.267
DecisionTreeClassifier 29.3% 17.16% 18.4% 15.5% 0.273
LogisticRegression 29.47% 14.34% 16.82% 13.35% 0.275
KNeighborsClassifier 22.76% 14.66% 15.84% 13.62% 0.205
LinearSVC 29.56% 14.95% 17.98% 14.33% 0.276

Table 5.2: Results for models trained with features constructed from the categorical
attributes of defect reports from Product 2

Table 5.3 and 5.4 shows the performance of the classification models trained with
features constructed from the textual attributes of defect reports from Product 1
and Product 2 respectively. For Product 1 the maximum accuracy of 57.36% is
achieved by using LinearSVC. For Product 2 the maximum accuracy of 37.17%
is achieved by using LinearSVC.

Classifier Accuracy Precision (Macro) Recall (Macro) F1-score (Macro) MCC
MultinomialNB 47.17% 15.96% 13.35% 12.98% 0.42
DecisionTreeClassifier 39.67% 17.75% 16.86% 16.72% 0.347
LogisticRegression 55.36% 25.04% 18.56% 19.05% 0.512
KNeighborsClassifier 38.5% 17.99% 15.65% 15.57% 0.331
LinearSVC 57.36% 31.61% 24.72% 26.02% 0.536

Table 5.3: Results for models trained with features constructed from the textual
attributes of defect reports from Product 1

Classifier Accuracy Precision (Macro) Recall (Macro) F1-score (Macro) MCC
MultinomialNB 27.12% 11.47% 8.31% 7.33% 0.245
DecisionTreeClassifier 24.67% 13.97% 13.44% 13.11% 0.226
LogisticRegression 35.12% 17.44% 13.0% 12.96% 0.328
KNeighborsClassifier 20.77% 15.75% 11.32% 11.22% 0.188
LinearSVC 37.17% 23.41% 19.88% 20.07% 0.352

Table 5.4: Results for models trained with features constructed from the textual
attributes of defect reports from Product 2

Table 5.5 and 5.6 shows the performance of the classification models trained with
features constructed from both the categorical and textual attributes of defect re-
ports from Product 1 and Product 2 respectively. For Product 1 the maximum
accuracy of 58.82% is achieved by using LinearSVC. For Product 2 the maximum
accuracy of 48.64% is achieved by using LinearSVC.

Classifier Accuracy Precision (Macro) Recall (Macro) F1-score (Macro) MCC
MultinomialNB 48.92% 18.81% 15.83% 15.55% 0.441
DecisionTreeClassifier 39.91% 18.57% 17.75% 17.72% 0.35
LogisticRegression 56.93% 28.65% 21.12% 21.79% 0.529
KNeighborsClassifier 34.44% 16.16% 14.72% 14.32% 0.286
LinearSVC 58.52% 33.93% 27.89% 28.91% 0.549

Table 5.5: Results for models trained with features constructed from both the
textual and categorical attributes of defect reports from Product 1

34


5. Results

Classifier Accuracy Precision (Macro) Recall (Macro) F1-score (Macro) MCC
MultinomialNB 38.35% 22.55% 18.42% 16.8% 0.366
DecisionTreeClassifier 32.66% 23.59% 23.29% 22.61% 0.308
LogisticRegression 47.75% 34.1% 29.26% 28.68% 0.462
KNeighborsClassifier 34.25% 25.11% 25.01% 23.17% 0.324
LinearSVC 48.64% 38.06% 35.76% 35.14% 0.472

Table 5.6: Results for models trained with features constructed from both the
textual and categorical attributes of defect reports from Product 2

5.2 Iteration 2

This section presents the results of the second iteration. The second iteration aimed
to evaluate which set of features and classification algorithm provided the best per-
formance when constructing a classification model for providing the reporter with
recommendations of which asset a defect originates. For each feature set the classifi-
cation model with the maximum Recall@5 is bolded. The metrics Recall@1,Recall@3
and Recall@10 are also provided to show the relationship between Recall and the
number of recommendations. Furthermore, the results of the feature set which pro-
vided the highest providing classifier for each product are complemented with a chart
which shows the relationship between Recall and Precision for different number of
recommendations.

Table 5.7 and 5.8 shows the performance of the classification models trained with
features constructed from the categorical attributes of defect reports from Product
1 and Product 2 respectively. For Product 1 the maximum Recall@5 of 62.72% is
achieved by using LogisticRegression. For Product 2 the maximum accuracy of
62.35% is achieved by using LinearSVC.

Classifier Recall@1 Recall@3 Recall@5 Recall@10
MultinomialNB 26.29% 48.6% 62.54% 81.39%
DecisionTreeClassifier 25.94% 45.98% 59.37% 75.89%
LogisticRegression 26.83% 48.6% 62.72% 81.66%
KNeighborsClassifier 21.68% 37.11% 44.42% 48.56%
LinearSVC 26.36% 48.54% 62.65% 81.72%

Table 5.7: Results for recommendation model trained with features constructed
from categorical attributes of defect reports from Product 1

35


5. Results

Classifier Recall@1 Recall@3 Recall@5 Recall@10
MultinomialNB 28.74% 48.12% 60.58% 79.58%
DecisionTreeClassifier 29.32% 48.58% 59.3% 75.86%
LogisticRegression 29.49% 49.91% 61.92% 80.95%
KNeighborsClassifier 23.02% 38.11% 45.03% 47.69%
LinearSVC 29.39% 49.43% 62.35% 81.55%

Table 5.8: Results for recommendation model trained with features constructed
from categorical attributes of defect reports from Product 2

Table 5.9 and 5.10 shows the performance of the classification models trained with
features constructed from the textual attributes of defect reports from Product 1
and Product 2 respectively. For Product 1 the maximum Recall@5 of 86.74% is
achieved by using LinearSVC. For Product 2 the maximum accuracy of 70.06%
is achieved by using LinearSVC.

Classifier Recall@1 Recall@3 Recall@5 Recall@10
MultinomialNB 47.16% 70.08% 79.28% 89.23%
DecisionTreeClassifier 39.86% 40.59% 41.08% 52.43%
LogisticRegression 55.35% 76.89% 84.33% 92.13%
KNeighborsClassifier 38.49% 57.08% 64.67% 68.94%
LinearSVC 56.35% 78.73% 86.74% 94.03%

Table 5.9: Results for recommendation model trained with features constructed
from textual attributes of defect reports from Product 1

Classifier Recall@1 Recall@3 Recall@5 Recall@10
MultinomialNB 27.12% 47.51% 58.01% 73.04%
DecisionTreeClassifier 24.83% 27.16% 28.43% 31.38%
LogisticRegression 35.11% 56.53% 66.43% 79.67%
KNeighborsClassifier 20.76% 35.31% 43.29% 45.98%
LinearSVC 37.02% 60.19% 70.06% 82.9%

Table 5.10: Results for recommendation model trained with features constructed
from textual attributes of defect reports from Product 2

Table 5.9 and 5.10 shows the performance of the classification models trained with
features constructed from both the categorical and textual attributes of defect re-
ports from Product 1 and Product 2 respectively. For Product 1 the maximum
Recall@5 of 86.59% is achieved by using LinearSVC. For Product 2 the maxi-
mum accuracy of 81.9% is achieved by using LinearSVC.

36


5. Results

Classifier Recall@1 Recall@3 Recall@5 Recall@10
MultinomialNB 48.91% 70.44% 79.59% 89.06%
DecisionTreeClassifier 40.17% 40.87% 41.32% 52.31%
LogisticRegression 56.92% 77.75% 85.09% 92.63%
KNeighborsClassifier 34.43% 51.7% 59.86% 63.59%
LinearSVC 57.91% 79.08% 86.59% 94.1%

Table 5.11: Results for recommendation model trained with features constructed
from both the textual and categorical attributes of defect reports from Product 1

Classifier Recall@1 Recall@3 Recall@5 Recall@10
MultinomialNB 38.36% 59.19% 70.21% 84.65%
DecisionTreeClassifier 32.67% 34.95% 35.99% 38.45%
LogisticRegression 47.76% 70.54% 80.26% 91.15%
KNeighborsClassifier 34.27% 52.71% 60.04% 61.84%
LinearSVC 48.8% 71.75% 81.9% 92.34%

Table 5.12: Results for recommendation model trained with features constructed
from both the textual and categorical attributes of defect reports from Product 2

Figure 5.1 and 5.2 shows the relationship between Recall and Precision for differ-
ent numbers of recommendations of the classification models trained with features
constructed from both the categorical and textual attributes of defect reports from
Product 1 and Product 2 respectively. Furthermore, the figures include the values
of the baseline model as a point of reference.

(a) recall@n (b) precision@n

Figure 5.1: Resulting (a) recall@n and (b) precision@n for recommendation model
trained with features constructed from both textual and categorical attributes of
defect reports from Product 1

37


5. Results

(a) recall@n (b) precision@n

Figure 5.2: Resulting (a) recall@n and (b) precision@n for recommendation model
trained with features constructed from both textual and categorical attributes of
defect reports from Product 2

38


6
Discussion

In this chapter the results of each iteration is discussed. Furthermore, the possible
future work and the threats to validity are discussed.

6.1 Iteration 1

The first iteration evaluated which set of features and classification algorithm pro-
vided the best performance when constructing a classification model for classifying
the asset from which a defect originates. All classification models that were trained
during the first iteration, achieved higher scores on all measures, than the bench-
mark values seen in Table 6.1 and 6.2.

Setup Accuracy Precision (Macro) Recall (Macro) F1-score (Macro) MCC
stratified 8.01% 2.2% 2.44% 2.13% -0.001
most_frequent 13.1% 0.3% 2.33% 0.54% 0.0
uniform 2.37% 2.26% 2.54% 1.65% -0.001

Table 6.1: The benchmark values by setup for Product 1.

Setup Accuracy Precision (Macro) Recall (Macro) F1-score (Macro) MCC
stratified 2.85% 0.8% 0.81% 0.78% -0.002
most_frequent 9.4% 0.08% 0.85% 0.15% 0.0
uniform 0.79% 0.83% 0.93% 0.63% -0.001

Table 6.2: The benchmark values by setup for Product 2.

Across all features sets used to train classification models, the models using Logisti-
cRegression and LinearSVC provided the highest accuracies. However, the choice of
algorithm did not make a significant difference on the F1-Score and MCC when only
using features constructed from the categorical attributes. Comparing the measures
of the best performing models using either features constructed from the textual
attributes or the categorical attributes shows that the textual attributes provides
more information for classifying the asset. The best performing classification model
for each product were trained with features constructed from both the categorical
and textual attributes using LinearSVC seen in Table 6.3.

39


6. Discussion

Product Classifier Accuracy Precision (Macro) Recall (Macro) F1-score (Macro) MCC
Product 1 LinearSVC 58.52% 33.93% 27.89% 28.91% 0.549
Product 2 LinearSVC 48.64% 38.06% 35.76% 35.14% 0.472

Table 6.3: The best performing classification model for each product.

The best performing classification model developed with the defect reports of Prod-
uct 1 achieves an accuracy of 58.52%. However, the model only achieves a macro-
averaged precision 33.93% and a macro-averaged recall of 27.89%. This shows that
when biasing the measures towards the least populated classes the probability of the
classifier correctly labeling an unseen defect report is 33.93 % and the probability
of correctly labeling a class given an instance of that class is 27.89%. The classifier
achieves a MCC of 0.549 which indicates a positive correlation between the predic-
tions and the correct labels. This implies that the classifier is making significant
predictions that are determined randomly. Similar observations can be made from
the results of the best performing classification model developed with the defect
reports of Product 2. However, the model has a lower accuracy of 48.64% but a
higher precision and recall of 38.06% and 35.75% respectively.

6.2 Iteration 2

The second iteration evaluated which set of features and classification algorithm
provided the best performance when constructing a classification model for provid-
ing the reporter with recommendations of which asset a defect originates. For both
products, the feature set that provided the best classification model was constructed
from both the categorical and textual attributes of the defect reports. The bench-
mark values for evaluating the developed classification models can be seen in Table
6.4. Comparing the classification models using features constructed from both the
categorical and textual attributes shows that the only models performing better
than the benchmarks, for both products, were using LinearSVC, LogisticRegression
or MultinomialNB.

Product Recall@1 Recall@3 Recall@5 Recall@10
Product 1 13.1% 36.75% 52.15% 78.71%
Product 2 9.4% 20.16% 26.66% 40.16%

Table 6.4: The benchmark values of the recommendation system for each product.

The best performing classification model for providing the reporter with a list of
recommendations of assets, for both products, was constructed using LinearSVC.
The Recall@k and Precision@k measures of this model can be seen in Table 6.5 and
Table 6.6 respectively. For both products the Recall@n increases as the number
of recommendations increases while the Precision@n decreases as the number of
recommendations increases.

40


6. Discussion

Product Classifier Recall@1 Recall@3 Recall@5 Recall@10
Product 1 LinearSVC 57.91% 79.08% 86.59% 94.1%
Product 2 LinearSVC 48.8% 71.75% 81.9% 92.34%

Table 6.5: The Recall@n of the best performing classifier of each product.

Product Classifier Precision@1 Precision@3 Precision@5 Precision@10
Product 1 LinearSVC 57.91% 26.36% 17.32% 9.41%
Product 2 LinearSVC 48.8% 23.92% 16.38% 9.23%

Table 6.6: The Precision@n of the best performing classifier of each product.

To compare the performance of the developed recommendation models with the
current list of assets, the Recall@n and Precision@n is aggregated to the F1-Score
which is the harmonic mean of the two measures. Since the balance between recall
and precision is something that needs to be studied the recall and precision are
weighted equally. The F1-Score for the classification models can be seen in Table
6.7. Furthermore, the Recall@n, Precision@n and F1@n measures for the currently
provided lists of assets can be seen in Table 6.8.

Product Classifier F1@1 F1@3 F1@5 F1@10
Product 1 LinearSVC 57.91% 39.54% 28.87% 17.11%
Product 2 LinearSVC 48.8% 35.88% 27.3% 16.78%

Table 6.7: The F1@n of the best performing classifier of each product.

Comparing the F1-Score for the currently provided lists of assets and the best per-
forming classification models shows that the recommendation systems perform bet-
ter regardless of the number of recommendations when weighting Recall@n and
Precision@n equally.

Product n Recall@n Precision@n F1@n
Product 1 > 40 100% < 2.5% < 4.88%
Product 2 > 100 100% < 1% < 1.98%

Table 6.8: The benchmark values of the recommendation system for each product.

For instance, by providing a list of 10 recommendations, the Recall decreased by
5.9% for Product 1 and 7.66% for Product 2 while the Precision increased by more
than 276% for Product 1 and 823% for Product 2.

6.3 Future Work
To further the research with classifying the asset from which a defect originates
using machine learning there are a few approaches that can be taken.

41


6. Discussion

6.3.1 Feature Engineering
One possible approach is to focus on the features used for classification. In this
thesis, two sets of features were constructed and evaluated. However, the degree
of which the classification model learns from each feature can be evaluated. For
instance, the severity attribute might not correlate with the asset from which the
defect originates and could therefore be excluded. This would reduce the number of
features which would reduce the classification and training time and also increase
the quality of the classifications.

Other than reducing the number of studied features, the existing features can be
tuned. For instance, the number of words selected from the feature vector con-
structed by the TF-IDF algorithm was 1,000. The number of selected words can
be tuned by performing cross validation on the training set with different values
of the parameter. An increase of the number of selected words might increase the
performance of the classification model since significant words that distinguish each
defect report might have been excluded when only selecting 1,000 words.

6.3.2 Tuni