Implementation and Evaluation of Auto-
matic Prioritization for Continuous Inte-
gration Test Cases
Master’s thesis in Software Engineering

Jiayu Hu

Yiqun Li

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY | UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2016


Master’s thesis 2016:06

Implementation and Evaluation of Automatic
Prioritization for Continuous Integration Test

Cases

JIAYU HU

YIQUN LI

Department of Computer Science and Engineering
Chalmers University of Technology | University of Gothenburg

Gothenburg, Sweden 2016


Implementation and Evaluation of Automatic Prioritization for Continuous Integra-
tion Test Cases
JIAYU HU
YIQUN LI

© JIAYU HU, 2016.
© YIQUN LI, 2016.

Academic Supervisor: Eric Knauss, Computer Science and Engineering
Industrial Supervisor: Gustaf Johansson, Ericsson AB
Examiner: Regina Hebig, Computer Science and Engineering

Master’s Thesis 2016:06
Department of Computer Science and Engineering
Chalmers University of Technology | University of Gothenburg
SE-412 96 Gothenburg | SE-405 30 Gothenburg
Telephone +46 31 772 1000 | +46 31 786 0000

iv


Implementation and Evaluation of Automatic Prioritization for Continuous Integra-
tion Test Cases
JIAYU HU
YIQUN LI
Department of Computer Science and Engineering
Chalmers University of Technology | University of Gothenburg

Abstract
Nowadays, continuous integration is widely accepted and implemented by most in-
dustrial companies due to the benefits of adding new functions and accelerating the
delivery of new products. But at the same time, frequent integration brings chal-
lenges, especially to large software companies. One challenge in the case company is
to prioritize the test cases that are most likely to fail first in order to minimize the
feedback loop from change to test result. In order to guarantee the product quality,
a significant number of tests shall be executed after every integration, which has
high demands on resources and thus has are associated with a high cost and time
consumption. This thesis proposes a method based on mining correlations between
historical data of test results and modified files. The Matthews Correlation Coeffi-
cient (MCC) and a scoring curves refers to Average Percentage of Faults Detected
(APFD) are used to evaluate results and determine correlations. We aim to create
an automatic way to provide a list of test cases that prioritized by their probability
to fail, and further help the case company to shorten their feedback loop and time
to market.

Keywords: continuous integration, prioritization, test cases, changed files, data mim-
ing, failure probability

v


Acknowledgements
We wish to thank our academic supervisor Eric Knauss for his valuable suggestions
and enthusiastic feedback. We would also like to thank Ericsson, especially to our
industry supervisor Gustaf Johansson, for giving us the opportunity to conduct this
study and generously provided historical data sets. Special thanks to our examiner
Regina Hebig for constructive feedback and active cooperation.

Jiayu Hu & Yiqun Li, Gothenburg, June 2016

vii


Contents

1 Introduction 1

2 Background and Related Work 2
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1.1 Case Company . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1.2 Continuous Integration and Testing . . . . . . . . . . . . . . . 2
2.1.3 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2.1 Mining Software Repositories (MSR) . . . . . . . . . . . . . . 3
2.2.2 Test Case Prioritization . . . . . . . . . . . . . . . . . . . . . 4
2.2.3 Average Percentage of Faults Detected (APFD) . . . . . . . . 5

3 Research Methods 6
3.1 Research Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.3.1 Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3.2 Literature Reviews . . . . . . . . . . . . . . . . . . . . . . . . 8

3.4 Data Obtaining and Processing . . . . . . . . . . . . . . . . . . . . . 8
3.4.1 Data Mining Tool . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4.2 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4.3 Data Transformation . . . . . . . . . . . . . . . . . . . . . . . 10

3.5 Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.5.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.5.2 Measurement Definitions . . . . . . . . . . . . . . . . . . . . . 12

4 Data Analysis 14
4.1 Familiarizing with Initial Data Sets . . . . . . . . . . . . . . . . . . . 14
4.2 Processing the Data Set with Small Size . . . . . . . . . . . . . . . . 16
4.3 Processing the Data Sets with Large Size . . . . . . . . . . . . . . . . 20

5 Results 26

6 Discussion 31
6.1 Bottlenecks and Challenges . . . . . . . . . . . . . . . . . . . . . . . 31
6.2 Trade-offs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.2.1 Selection of Algorithms . . . . . . . . . . . . . . . . . . . . . . 32

ix


Contents

6.2.2 Filtering Data Sets . . . . . . . . . . . . . . . . . . . . . . . . 32
6.2.3 Scoring System . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6.3 Feasibility, Applicability and Practicality . . . . . . . . . . . . . . . . 33
6.4 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.5 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

7 Conclusion 36
7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Reference 38

x


1
Introduction

In recent years, software companies are putting more and more efforts towards con-
tinuous integration in order to deliver high quality software to their customers as
fast as possible. Continuous integration (CI), as one of the practices of agile de-
velopment, demands that members of the development team integrate their work
frequently [1] [2]. This practice supports the advocation of agile development: ef-
fectively respond to change [3].

However, when companies try to adopt CI, many of them are struck by the time
limit. For instance, companies find that the feedback loop from regression tests are
too long [4]. To be more specific, if the developers integrate their work every day,
while not receiving feedback response until two days later. In this case, the whole
developing process would be slowed down. For increasing the efficiency, it is of great
significance for developers to find a method that solves above mentioned issue while
not decreasing the quality of software testing.

One of the most effective ways is prioritizing critical test suits [30]. The priori-
tization helps developers to find out the most critical test cases from test suits,
which dramatically optimizes the feedback time from the selected test suites. The
study of Felderer had also confirmed the feasibility of this approach. He did several
case studies to prove that the process could be used in real world for risk-based
testing and used a cache to monitor fault-prone files and recommends test cases to
rerun to cover updated files [5] [6].

In this study, we aim to create and evaluate a methodology that automatically
prioritizes test cases based on mining the historical data of test results and modified
files. One academic and one industrial supervisor supervised this thesis.

Section 2 provides the background of this study and a review of related litera-
ture; Section 3 describes the purpose and approach of this study; Section 4 shows
the study and analysis processes; Section 5 reports and explains results; Section 6
contains the discussion and Section 7 gives the final conclusion.

1


2
Background and Related Work

This section first provides the background of this study, and gives reviews of related
literature to test case prioritization.

2.1 Background
The background of this study is firstly introduced in this subsection.

2.1.1 Case Company
In this study, the case company is Ericsson. It is a world leader in the rapidly chang-
ing environment of communications technology-providing equipment, software and
services to enable transformation through mobility. Some 40 percent of global mobile
traffic runs through networks they have supplied. More than one billion subscribers
around the world rely every day on networks that they manage. With more than
39,000 granted patents, they have one of the industry’s strongest intellectual prop-
erty rights portfolios. Currently, the company is putting efforts on CI and trying
to improve their processes. And we are working with one of their testing teams,
gathering and analyzing the data of actual industrial test results.

2.1.2 Continuous Integration and Testing
Generally, Continuous Integration (CI) is a set of principles that apply to the daily
workflow of development teams, which recommends that all code must be kept in
a repository. In its life cycle, when a developer commits his or her changes to
the repository, system would pick up changes and verify them. At the same time,
change history is recorded for version control [1] [19]. Such frequent integration
requires people to increase test coverage for each new change, while also keep the
tests fast.

Testing aids and simplifies the detection of software failures and further discov-
ers and corrects the defects. Testing does not work to guarantee the functions of a
product under all conditions but finds the problems under specific conditions [21].
Functional testing is widely used in industrial development, which is a quality as-
surance (QA) process and usually describes what the system does [22]. In black

2


2. Background and Related Work

box testing, functions are tested by putting them as the input and examining the
output, without considering internal program structure [23]. In the case company,
they have test suits that can simulate different situations and give accurate feedback.
Regression testing is typically the largest test effort in industrial software includes
functional testing [24]. It checks numerous details from prior product features, and
tests new designs of new software to ensure that prior functionality are still sup-
ported.

2.1.3 Data Mining
Data Mining, also popularly known as Knowledge Discovery in Databases (KDD),
refers to the nontrivial extraction of implicit, previously unknown and potentially
useful information from data in databases [8]. It is an automatic or more usually
semiautomatic process [9]. There are four basic different styles of learning that ap-
pear in data mining applications. In classification learning, the learning scheme is
presented with a set of classified examples from which it is expected to learn a way of
classifying unseen examples. In association learning, any association among features
is sought, not just ones that predict a particular class value. In clustering, groups
of examples that belong together are sought. In numeric prediction, the outcome to
be predicted is not a discrete class but a numeric quantity [10].

2.2 Related Work
Reviews of literature that related to test case prioritization are given in this subsec-
tion.

2.2.1 Mining Software Repositories (MSR)
In the continuous integration developments, a large amount of information about
the change history and the results of executed test cases can be produced and needs
to be stored. The case company Ericsson does have the repositories that stored
these version control data. By extracting from that data, we are able to obtain the
adequate materials that we need for our analysis.

The methodology of Mining Software Repositories (MSR) helps to analyze the big
data from software repositories [17]. And Ahmed [18] also demonstrated that it is
valuable to mining software repositories for assisting managers and developers in
performing a variety of software development, maintenance, and management ac-
tivities. The MSR processes contain two main phases, which start with preparing
the data and followed by the analysis phase. We applied both two phases in this
thesis work, and put the main focus on data analysis. The reason is that we did not
have direct access to Ericsson’s databases, instead, we asked the help of industrial
supervisor to extract related historical data from repositories and processed some
transformation which exclude noises before we performed analysis.

3


2. Background and Related Work

The original data sets provided were filtered CSV files, and we processed a further
transformation by using our own scripts which transforms CSV files into ARFF files
and make them usable in Weka for later analysis.

2.2.2 Test Case Prioritization

Saha et al [20] came up with a method that prioritizes test cases, so the higher pri-
ority test cases would have a higher likelihood of locating bugs. They combined the
statistical methods with the scripts written in programming languages for address-
ing the problem of regression test prioritization. Their approaches bring us good
inspiration and starting point.

At the same time, this study is based on the risk of test cases execution history, and
we focused on prioritizing test cases. According to Felderer and Schieferdecker’s [11]
taxonomy of risk-based testing, our study is one of the risk-based testing specifics
in test processes.

Wang [7] did a similar study in another case company and analyzed data sets which
had relatively small sizes. In her study, she aimed at defining out fitness functions
based on source code changes. She looked into changes of software artifacts and
correlating those with test failures, and then took into account not only test quality
but also hidden technical dependencies in the software. In this study, we extend her
method and prioritize test cases based on changed files. The two algorithms that
Wang applied are NaiveBayes and RandomForest, where the latter is a notion of
the general technique of random decision forests [12] [13] and an ensemble learning
method that can be applied to process classification, regression and other tasks.

The RandomForest is operated by constructing a multitude of decision trees at
training time and outputting the class that is the mode of the classes (classifica-
tion) or mean prediction (regression) of the individual trees [14]. This algorithm
can provide many benefits. It can handle a large number of input variables, while
producing highly accurate classifier for a variety of materials. And for incomplete
or unbalanced classification data sets, it can balance errors and generate precise
predictions. Both from the actual testing results and several useful literatures [25]
[26] [27], we found out that the RandomForest is arguably be the most appropriate
and efficient algorithm in this study. Therefore, even if it has a few shortcomings,
such as high hardware requirements and time consumption, we decided to focus on
using RandomForest for our analysis and discuss trade-offs with the company.

Based on the above, we would analyze received historical data by using RandomFor-
est algorithm, and then prioritize test cases based on our analysis results. Generally,
if the change of a file made some test cases to fail recurrently, those test cases would
have higher priorities for executing when the same file was changed.

4


2. Background and Related Work

2.2.3 Average Percentage of Faults Detected (APFD)
Gregg et al [28] presented the methodology of Average Percentage of Faults Detected
in their report. It is a metric designed for measuring the rate of fault detection for
test suite execution. They described several techniques for using test execution in-
formation to prioritize test cases for regression testing, and reported their outcomes
of different experiments. Wherein its calculation is a great reference for our scoring
system in the late stage of study.

5


3
Research Methods

This section focuses on describing the research purpose, questions and methodolo-
gies of this study. Several contents of data analysis are also included to introduce
scientific research methods and processes.

3.1 Research Purpose
The main purpose of this thesis is to study data mining methodology and analyze
data sets of historical testing results from the case company. We aim to create an
appropriate method that automatically prioritizes the test cases. In the later stages,
the outcomes are going to be tested and evaluated by real industry data to guaran-
tee their feasibility, applicability and practicality. For the case company, the final
result shall be able to help shorten the time-to-market while also guaranteeing the
software quality.

3.2 Research Questions
In order to achieve the above study purpose, we set up three research questions
(RQ) that help us to complete this study step by step:

RQ1: To what extent is the available data at the case company suitable for predic-
tion of test case failure?

With RQ1, we check if the data is suitable for automatic classification and whether
we can compute the probability of failure for individual tests based on MCC mea-
sure (see Section 3.3).

RQ2: To what extent can APCIT be used to prioritize the test cases, are there
any existing restrictions?

With RQ2, we check to what extent the failure probabilities of the individual tests
can be used to prioritize functional test cases at the case company. For this, we cre-
ate baselines and use ranking referring APFD metric to evaluate the prioritization.

6


3. Research Methods

RQ3: To what extent can APCIT bring benefits to the case company?

With RQ3, we enter a critical discussion on how the results for RQ1-2 can pro-
vide value to the case company, based on semi-structured interviews with related
development groups in the case company.

3.3 Research Methodology
In Continuous Integration and automatic integration tests, the changes on one file
or similar files could cause the same error and make the same test case fail. The
repository of the case company can record file updates of each integration, as well as
test results of different testing jobs into the historical data. By implementing data
mining on the historical data, we aim to find out the correlation of changed files and
test cases.

In order to achieve the research purpose, we combined the statistical methods with
data mining knowledge to guide our study. We established and verified the results of
prioritizing test cases based on historical data that contains file changes and results
of automatic integration tests. For the final results, the method shall take a set
of test results and a set of file changes as input. The method shall automatically
prioritize test cases and give the highest prioritization to those tests that are the
most likely to fail given one set of changed files.

3.3.1 Interviews
This study process started from shallow to deep, gradually progressing. In order to
understand how automatic prioritization of functional integration tests could help
the case company, we conducted semi-structured interviews with the case company
and aimed to probe what is the use case of prioritization. To be more specific, we
were trying to understand why the company needs prioritization and what kinds of
results were desired. Interviews were handled during the weekly supervision meet-
ings where both supervisors participated. And up to two more people from the same
development department also joined several meetings. The presence of them helped
to guide us in obtaining sufficient as well as accurate information that we needed
for our thesis work.

Since those interviews are semi-structured, specific questions and contents were not
written down nor recorded, instead, we had discussions and made notes. By com-
paring and summarizing notes of two researchers after every meeting, we made a
maximum effort to ensure that there was no misunderstanding or loss of information.

After having direct dialogues with industry, we learned that all test cases shall
be run eventually regardless of the prioritization results in the case company. For
not running all test cases every iteration, our method is supposed to help to find

7


3. Research Methods

which test cases are most likely to fail for each job. By using our solution, developers
would be able to gain a maximum guarantee of quality by selecting test cases from
the sorted test cases when they were facing time-critical deliveries or needing fast
feedback.

3.3.2 Literature Reviews
After the in-depth understanding of use cases, we started to study helpful and valu-
able literature that we found. Due to this study is experimental and focusing on
actual operation, we did not intend to do systematic literature reviews.

The purpose of reading literature is to gain domain knowledge, as well as com-
paring different data mining methods and algorithms. The selection of papers was
focused on literature that published in recent years and new reports given in related
conferences.

3.4 Data Obtaining and Processing
After correct understanding of the use cases and learned necessary domain knowl-
edge, we began to analyze data sets provided by the case company. The processing
started with data sets that with rather small size, this helps us to understand data
structure and then created our own data processing scripts. At the same time, this
strategy also provides convenience for testing the scripts. In this study, we obtained
and processed four kinds of data sets:

Table 3.1: Four kinds of data sets.

Name of data set(s) Description
Initial data sets Two sets, one lists details of test results based

on each job and another based on test cases.
Small data set 16_90pct Combined job-based and test-case-based

historical data, contains 168 jobs and 299 test
cases. These test results are from 2016, and
only jobs with over 90% pass rate are included.
It is a sub-set of large date set.

Large data set 14_90pct Combined job-based and test-case-based
historical data, contains 5540 jobs and 683 test
cases. These test results are from 2014, and
only jobs with over 90% pass rate are included.

Large data set 14_all Combined job-based and test-case-based
historical data, contains 5708 jobs and 983 test
cases. These test results are from 2014, all
jobs are included.

8


3. Research Methods

Processed results were discussed immediately and directly with the case company,
we determine whether the final outputs meet the actual demands. After the case
company recognized processed results of small data size, we also analyzed larger sets.
In the end, we confirmed feasibility, applicability and practicality of our method by
implementing and testing it to the data sets with different sizes and features.

3.4.1 Data Mining Tool
In this study, we used the Waikato Environment for Knowledge Analysis (Weka) as
our analysis engine. Weka is a public data mining workbench that assembles a large
number of machine learning algorithms that can take on the task of data mining.
Weka also includes data preprocessing, classification, regression, clustering, associ-
ation rules and a visualized interactive interface [31]. It is worth mentioning that
Weka contains lots of useful algorithms including RandomForest analysis, which is
beneficial to our case.

Like most of other spreadsheet or data analysis software, Weka can only deal with
the data collections that are in two-dimensional forms. Weka format for analyzing
and storing data is in ARFF file, which is an ASCII text file and is going to be
introduced in Section 3.4.3.

3.4.2 Data Structure
The original data sets are CSV files that include timestamp, version, parent, build
job, scope, executed test cases and corresponding test results.

The jobs are test events, where the timestamp indicates the operation time for
each job. Several jobs could be run at the same timestamp; the scopes are applied
test suits. Different scopes could be used at the same time overlap, and the same
test cases might have different names. In our study, test cases with other names
were treated as separate objects; the columns start with string of characters show
test results specific cases, while the columns start with string “file” or “test” indicate
the changed files.

9


3. Research Methods

An example of the data sets we received is shown in Figure 3.1.

Figure 3.1: Original CSV file.

In order to analyze the relevance of files and test cases, we created a java script that
removes redundant data, unifies test results to three states (PASS, FAIL_or_ERROR,
NONE) and transforms the original files so that the data mining tool can analyse
them.

3.4.3 Data Transformation
The original data sets were transformed, which were in CSV format, into ARFF
files. The Attribute-Relation File Format (ARFF) is ASCII text files, it includes
the relationship declaration and the property declaration of attributes. The rela-
tionship declarations are defined from the first active line of ARFF file with the
format: @relation <relation-name>, where the <Relation-name> is a string. If the
string contains spaces, it must be quoted; the property declarations are defined af-
ter all relationships are declared and are represented by the declarations that start
with “@attribute” statement. These statements define the names and types for at-
tributes. As shown in Figure 3.2, the orders of the declaration statements are very
important, because they define positions of attributes in the data sets. The recog-
nition of ARFF files are branches, and thus arbitrary line breaks are not allowed.
Blank lines (or all blank lines) and the lines start with "%" (as comments) will be
ignored by Weka.

10


3. Research Methods

Figure 3.2 shows one of the ARFF files generated from the original data.

Figure 3.2: Transformed ARFF file.

3.5 Measurements
Models can be built by the data mining tool from the transformed files. In order to
measure the quality of these models, the confusion matrix and some measurements
need to be introduced.

3.5.1 Confusion Matrix
The confusion matrix contains the actual and the predicted classifications, and fur-
ther help to judge right and wrong. Since we have three states in our data, there are
three class classifiers in our confusion matrix (Figure 3.3). The columns represent
the instances in a predicted class, and the rows represent the instances in an actual
class.

Figure 3.3: The Prioritization Results Example.

In this confusion matrix, the letter “a”, “b” and “c” represent “PASS”, “FAIL/ER-
ROR” and “NONE” respectively. In more detail, "aa", "bb" and "cc" are the true

11


3. Research Methods

positive (TP) values; the false negative (FN) values are the sum of corresponding
rows but exclude TP, for example, "ab + ac" is the FN of "a"; the true negative (TN)
value of a class can be calculated by summing all columns and rows but exclude its
own’, in our case, "bb + bc + cb + cc" is the TN of "a"; and the false positive (FP)
values are the sum of corresponding column but exclude TP, for instance, "ba + ca"
is the FP of "a".

3.5.2 Measurement Definitions
We used four retrieval measures in our study, which are recall, precision, f-measure
and Matthews correlation coefficient (MCC). However, the first three are just reflec-
tion of statistical characteristics of data, we decided to focus on MCC. The results
of these measures are all based on four categories:

Table 3.2: The four categories.

True Positives (TP ) The test cases that are predicted to fail and
actually failed according to the ground truth;
it is the true correct rate.

False Positives (FP ) The test cases that are predicted to fail but
actually not failed according to the ground
truth; it is the false alarm rate.

True Negatives (TN) The test cases that are not predicted to fail and
actually not failed according to ground truth;
it is the false correct rate.

False Negatives (FN) The test cases that are not predicted to fail and
actually failed according to the ground truth;
it is the false negative rate.

MCC is a measurement that is used in machine learning as a measure of the quality
of binary (two-class) classifications [15]. Since it includes true and false positives and
negatives, it is actually calculating the correlation coefficients between the observed
and predicted classifications. Thus it is a balanced measure that is used even if the
classes are of very different sizes. In our case, we have several data sets with very
different sizes, and thus MCC can give the most reliable guarantee to the correctness
of our final results. After considering, we decided to focus on using MCC to verify
our final results. The MCC can be calculated automatically by Weka or with the
following equation:

12


3. Research Methods

The return value of MCC is between -1 and +1. Rumsey [16] determines the mean-
ing of these coefficients as following:

Table 3.3: The meaning of different MCC coefficients.

MCC value Description
+1.0 Indicates a perfect prediction, which tests that are

recommended by the recommender system and failed
according to the ground truth.

+0.70 + 0.99 Indicates a very strong positive relationship.
+0.40 + 0.69 Indicates strong positive relationship.
+0.20 + 0.39 Indicates moderate positive relationship
+0.19 − 0.19 Indicates no or negligible relationship.
−0.20 − 0.29 Indicates weak negative relationship.
−0.30 − 0.39 Indicates moderate negative relationship.
−0.40 − 0.69 Indicates strong negative relationship.
−0.70 − 0.99 Indicates very strong negative relationship.
−1.0 Indicates total disagreement between prediction and

observation.

Generally, the higher the MCC values, the better the prediction. If a value is be-
tween -0.19 to +0.19, it means the result is random guessing and not reliable nor
useful. Those MCC values reflect the quality of classification by using classifier.

13


4
Data Analysis

This section describes the details of data processing on different sets. The analysis
processes of data were implemented step by step. In order to have a comprehen-
sive analysis of the characteristics of data, we started with familiarizing us with the
initial data sets (Section 4.1), while also creating our own analyzing scripts. The
scripts were first tested with the data set with small sizes (Section 4.2) for refine-
ment and validation, and then were applied to the large data set (Section 4.3) for
obtaining the final results.

4.1 Familiarizing with Initial Data Sets
By discussing with the case company, we learned that the case company has started
applying CI in its development process and it is trying to improve related work-
flow. In order to develop software with high quality, a large number of functional
test cases are constantly added and modified during the feature development phase.
Thus they are faced with the problem that running all test cases creates long feed-
back loops, which directly increases the time-to-market and reduces the flexibility
when responding to frequently changed requirements.

In the case company, functional test cases are testing the system from a black box
perspective where people supplies the inputs and validates the outputs, details of
execution are hidden. These test cases simulate different environments and situa-
tions. Test cases are grouped into test suites and into different scopes.

As introduced in Section 3.4.2, the original data set contains historical test case
results, and it is in CSV format. In the beginning, the data we acquired were initial
data sets (introduced in Table 3.1). By parsing those data sets, we understood the
data structure and captured the basic characteristics of the test results.

14


4. Data Analysis

The Figure 4.1 visualizes and summaries the parsed job-based outcome in the initial
data set.

Figure 4.1: The percentage of failed test case distribution for each job in initial
data set.

This pie chart shows the percentage of failure for each job. The letter “P” in the
subscript represents percentage of failed test cases, and the numbers are percentages.
For instance, in this data set, 55% jobs inclue zero failed test cases. More than half of
jobs do not contain any failed test case, and the rest of jobs only have few failed tests.

Figure 4.2: The pass rate distribution based on test cases in initial data set.

Similar characteristics can be found in Figure 4.2, which is the parsed test-case-
based outcome in initial data set. Here, the x-axis shows the pass rate and the
y-axis describes the number of test cases. Similar to Figure 4.1, the letter “P” in
this subscript represents actual pass rate, and the numbers are percentages. The
histogram not only shows that most of test cases never or just rarely failed, but also

15


4. Data Analysis

indicates some test cases, 313 of them, actually failed very often. This is a finding
that cannot be ignored. After discussing with the case company, we learned that
one of the reasons could be those often-failed test cases themselves are problematic.
Also, in some situations, test cases failed because of the inappropriate testing envi-
ronments.

Due to the presence of above-mentioned situations, we recognize that those histori-
cal data sets are imbalanced and hard to analyze. Some algorithms cannot handle
imbalanced data sets, but some can. As argued in Section 2.2.2, the algorithm of
RandomForest is arguably be the best choice for our case. Currently, the present
inputs seem excessively extreme where a few test cases have over 95% failure rate
and most of test cases never fail.

4.2 Processing the Data Set with Small Size
After we had familiarized ourselves with the initial data sets, we received a new
data set from the case company that combined job-based and test-case-based data
- the small data set 16_90pct (introduced in Table 3.1). It helped us to analyze the
correlation between test case results and changed files.

For analyzing the correlation between test cases and the failure rate, we decided
to focus on studying two correspondences in the data: the number of test cases and
failure rate based on each job, and the number of executions and failure rate based
on each test case. By doing this, our objective was to understand better on how to
do prioritization and evaluation according to the information we got.

In this study, heat maps are used to visualize the statistical results. All maps
follow these principles:

a.Results are sorted according to the failure rate (including FAIL and
ERROR) in descending order, thus the upper jobs or test cases are hav-
ing higher failure rate.

b.The left column represents number of test cases executed in each job
or the number of executions for each test case.

c.The right column shows corresponding failure rates of objects in the
left column.

d.The colors in the rows indicate differences with the average value, the
redder the higher than the average, and the bluer the lower the average.

e.The blue solid lines in the middle of both columns also represent dif-
ferences with the average value, when it skewed to right means higher,
while to left means lower.

16


4. Data Analysis

f.The dotted lines in the middle of both columns represent average val-
ues.

The Figure 4.3 is the heat map for small data set 16_90pct, which shows the corre-
lation between jobs and test cases. Several test cases were verified in same job, and
the failure rates of jobs are indicated by colors.

Job ID
1461990

Number of test cases Failure rate

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1468557

Avg ≈ 297.27

Max = 299

Min = 125

Avg ≈ 0.79%

Max ≈ 5.69%

Min = 0%

Figure 4.3: The number of test cases and failure rate based on each job in small
data set 16_90pct.

From the above heat map we can see that solid line is relatively stable in the left
column and the color does not change a lot, which means in this data set, there is
a very even distribution of tests run per job. In the source data, 92 out of 168 jobs

17


4. Data Analysis

do not contain failures, and it is shown in the picture that most of jobs have blue
and white colors in the column of failure rate, where these colors means low or none
differences with the average. Less than half of the jobs have high failure rates, which
are indicated by color red. The same features are shown by lines as well, where in
the column of failure rate, only part of solid line crosses the dotted line and goes
right, while most of solid line stay in the left. Since the average value is only 0.79%,
and minimum failure rate is 0%, we can say that most of jobs contains none or few
failed test cases.

Besides observing the characteristics of jobs, it is very helpful to study the fea-
tures of test cases. The Figure 4.4 shows the correspondence between number of
execution for different test cases and their related failure rates for the same data set.

Number of executions Failure rate

Test case ID

lte_12695647::tc_12139230 

ttx_1412353::tc_2568065

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Avg ≈ 167.03

Max = 168

Min = 162

Avg ≈ 0.8%

Max ≈ 10.18%

Min = 0%

Figure 4.4: The number of executions and failure rate based on each test case in
small data set 16_90pct.

As shown in above heat map, blue solid and dotted lines from the left column in-

18


4. Data Analysis

dicates that test cases were executed multiple times. In the right column, most of
these test cases have blue and white color for their failure rate, which means most
of them rarely or never failed. Since the most of solid line stays in the left of dotted
line, it means their failure rates are lower than average value, which is around 0.8%.
According to the actual statistical results, around 20% of the test cases have failed
records. A test case could have up to 10.18% failure rate. By comparing lines in
both columns, we also found that some of the failed test cases are having relatively
high number of executions.

After discussion, we agreed that if a test case had relatively high failure rate but its
runs were far less than the average number, this test case would have low priority
or be ignored. Such statistical results were not considered in this study, but they
could be good references for company’s test selection.

Besides creating the heat maps, we used Weka as the data mining tool and built up
models with the classifier RandomForest, then successfully calculated MCC values.
In order to create more reliable analysis results, 10-fold cross validation was applied
to evaluate the models. Concretely, 90% of instances in the data set were used as
training set while rest 10% were treated as test set. Those 10% were randomly chosen
by the system, and after processing them, the program then will automatically al-
ternate another 10% data as new test set until all instances were treated as test sets.

MCC value of three status were computed: MCC of Pass, MCC of Fail/Error and
MCC of None. The analyzed outcomes are summarized and visualized in Figure 4.5.

MCC of PASS MCC of FAIL/ERROR MCC of NONE

Figure 4.5: The MCC results of 10-fold cross validation for small data set
16_90pct.

From the above pie charts, it is very clear that most of calculations returned the
value “NaN”, which means those MCC values cannot be computed. We argue that
the reason is that source data set is relatively imbalance. As shown in previous heat
maps, most of the test cases never failed, as the result, the formula for calculat-
ing MCC would have value “0” as its denominator (formula is described in Section
3.5.1), and thus the calculation cannot be performed. Those kind of MCC results
are not helpful for evaluating the quality of the built models (Table 3.3).

19


4. Data Analysis

During this phase, we found most of MCC values were not calculable due to imbal-
anced data. We argue that such situation could due to two main reasons: the data
in analyzed set is extremely imbalanced and the automatic cross validation of Weka
is not suitable for our case.

For the first reason, since the source data includes historical information from real
industry, the data sets could be unbalanced due to company’s use case. We did not
modify the source data, because filtering the data that we do not expect manually
is not a proper way to do data mining; And for the second reason, it is because
the processed data set includes information of the same test cases that were used
in different time. When Weka randomly selects instances for cross validation, there
are situations that is using future data, the test cases that applied in later jobs, as
the training sets to predict the test cases that applied in earlier jobs, the past data.
This kind of obviously unreasonable cases will seriously affected the forecast results
and later prioritization. To solve this problem, we modified our scripts to manually
select the training sets and the test sets based on calendar order.

In this small data set, job instances are listed in the order of their execution time,
so the earlier data is in the forefront. Thus we selected 66% of data as training set,
and the rest as test set. The new MCC values are presented in Figure 4.6.

MCC of PASS MCC of FAIL/ERROR MCC of NONE

Figure 4.6: The improved MCC results for small data set 16_90pct.

By comparing with the outcomes in Figure 4.5, the improved MCC results from
Figure 4.6 are more satisfying. As can be seen, the proportions of calculable MCC
values are increased, and most of their values close or equal to one. The above
chart shows 85% of MCC values in category of FAIL/ERROR cannot be calculated
(indicates “NaN”), which consistent with the characteristics described in Figure 4.4,
where roughly 85% of tests did indeed never fail. It proves that the new models
are doing very good in predicting (Table 3.3). Due to above relatively satisfactory
results, we are going to apply the same method when processing larger data set,
which is the overall data.

4.3 Processing the Data Sets with Large Size
After receiving good results from processing small data set, we started to test our
method with larger data sets (introduced in Table 3.1). It is worth mentioning that

20


4. Data Analysis

the new data sets have better structure: the previous data sets only contained the
changed files and related executed test cases, but the new also include information
of changed test cases. We also started with visualizing overviews of the entire data
sets by using heat maps.

Number of test cases Failure rate

Job ID

853766

1295627

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Avg ≈ 290.73

Max = 304

Min = 1

Avg ≈ 0.99%

Max ≈ 9.97%

Min = 0%

Figure 4.7: The number of test cases and failure rate based on each job in large
data set 14_90pct.

The Figure 4.7 shows that every job includes at least one test case, while the average
failure rate is below 1%. Similar to the small data set, only a small part of right
column are in color red and most part of solid line stay to the left of the dotted line.

21


4. Data Analysis

The Figure 4.8 shows the correspondence between jobs and test cases in this large
data set.

Number of executions Failure rate

Test case ID

apr_1284863::tc_10292081

ttx_1412353::tc_4256815

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Avg ≈ 2361.67

Max = 5519

Min = 0

Avg ≈ 0.95%

Max = 100%

Min = 0%

Figure 4.8: The number of executions and failure rate based on each test case
based in large data set 14_90pct.

From above heat map, we learned that some test cases were never executed, and
the average failure rate for each test case is around 0.95%. By observing the right
column, we find there are test cases that always failed, as well as some with zero
failure records. There is a blank area on the top of the right column, it is because
some test cases were never executed. Those unexecuted test cases have zero execu-
tion and “NaN” failure rate. Again, according to the distribution of colors and blue
lines, we can see that only a small part of these test cases failed.

Both Figure 4.7 and Figure 4.8 are heat maps of the data set that only contains test
cases with over 90% pass rate. After observing them, we found their situations were
similar to the small data set, in which most of the tests never or seldom failed.

22


4. Data Analysis

For having more complete understanding of the characteristics of the new data sets
and considering the industrial environment, we also generated heat maps for the
data set which contains all test cases regardless of their pass rate, i.e. Large data
set 14_all (introduced in Table 3.1).

Number of test cases Failure rate

Job ID

1282713

1295627

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Avg ≈ 290.28

Max = 368

Min = 0

Avg ≈ 1.71%

Max = 100%

Min = 0%

Figure 4.9: The number of test cases and failure rate based on each job in large
data set 14_all.

As shown in Figure 4.9, the analysis results of new data set is a bit different. First
of all, we noticed the average number of test cases in each job is very close to pre-
vious data set, but in the new one, some jobs contain zero test cases. For some
jobs, all test cases they contain are failed. Since jobs with high failure rate were not
removed, the average value of failure rate is doubled. However, there were still only

23


4. Data Analysis

a few instances that have failed records.

In order to do a comprehensive comparison, we also created the test case based
heat map for this complete data set.

Number of executions Failure rate

Test case ID

apr_1284863::tc_10292081

Avg ≈ 1685.6

Max = 5683

Min = 0

Avg ≈ 1.37%

Max = 100%

Min = 0% ttx_1412353::tc_4256815

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 4.10: The number of executions and failure rate based on each test case
in large data set 14_all.

Comparing the results of Figure 4.8 with the picture above, by including all test
cases, the average number of test case executions is significantly decreased, and the
average failure rate increased. The blank area in the failure column becomes larger
due to more unexecuted test cases were included. Both the Figure 4.9 and Figure
4.10 help to prove that even in the data set with complete information, there is still
only a very small part of the test cases that failed.

With the help of company’s powerful hardware, we successfully processed the large
data set that contains all test cases, built up models and calculated MCC values.

24


4. Data Analysis

Building up models for 300 jobs from the large data set took around one week, and
using the following 100 jobs as test set to evaluate the prediction only took around
two hours. The results were also visualized into pie charts and shown as following.

MCC of PASS MCC of FAIL/ERROR MCC of NONE

Figure 4.11: The MCC of large data set with large data set 14_all.

Comparing the results in Figure 4.11 with previous outcomes (shown in Figure 4.5
and Figure 4.6), the new one looks much better. For the large data set, data is rel-
atively more balanced since it contains all test cases, thus more MCC of test cases
were computable while most of them were good values (Table 3.3). The light blue
part in the chart indicates test cases with “one” as their MCC means the prediction
for them are 100% accurate, which is unlikely to occur in real industrial. The feed-
back shows our predictions of those test cases are indeed very accurate.

Apart from the perfectly predicted tests cases mentioned above, quite a few cases
also have rather high MCC values, between +0.7 and +0.99. Also, we foundd that
there are more “NaN” values in the FAIL/ERROR category than the two other. It
is because the prediction tend to be only PASS or NONE, and few test cases are
predicted to be FAIL/ERROR, which is consistent with the heat map we showed
before (Figure 4.10).

All in all, comparing with the MCC results from cross-validation, the new results are
more satisfactory. With the new method, we successfully built models, implemented
the prediction method and its evaluation.

25


5
Results

As described in Section 4, we processed different data sets and collected correspond-
ing results. Starting with the small data set, after successfully building up models,
we first used them to make predictions for each test case and inferred their failure
rate in different jobs, we then prioritized test cases for each job. As shown in Figure
5.1, it is an example of prioritization result where test cases are sorted based on their
failure rate from Job 7, the test case with the highest probability to fail is listed in
the first row. In following example, the probability to fail of the first test case is 77%.

Figure 5.1: The Prioritization Results Example.

In conjunction with Figure 4.6, these predicted results are more credible. By refer-
encing the APFD (introduced in Section 2.2.3), we adopt a scoring method which
rates points for each instances and then visualized results into scoring curves. In
order to prove the efficiency of our method (APCIT), we compare its score with
scores of three other priority ways:

26


5. Results

1.Optimal order, the ideal list where test cases with failures would be
put in the top.

2.Natural order, the original list of test cases that is sorted alphabet-
ically.

3.Random order, the list contains test case that sorted by the system
randomly.

4.APCIT, the list sorted by our method.

The score for each job is calculated by computing the sum of the ordinals of the
failing test cases in the sorted list. In the optimal list, test cases with failures would
be put in the top of it. E.g. When a job contains ten test cases and three of them
fail. These three shall be listed as the first three and have "1", "2" and "3" as their
ordinal respectively. For calculating its optimal score, we have "1 + 2 + 3", and
the result point is "6"; In APCIT list, if failures were predicted and ranked as "1st",
"2nd" and "4th", then the score would be "1 + 2 + 4 = 7". The scores of natural
order and random order are calculated in the same way. The closer the score of a
sorting method and the optimal result of jobs, the more accurate and valuable the
predicted prioritization.

The reason we introduce the natural order and random order is that their results
can reflect current situation of functional testing in the case company. At the same
time, comparing with the optimal results in scoring curve can also highlight the ad-
vantages of APCIT. The following picture shows the evaluation results of prediction
for small data set.

27


5. Results

Figure 5.2 and 5.3 show the curves for small data set 16_90pct.

Figure 5.2: The scoring curves for small data set 16_90pct.

Figure 5.3: The magnified scoring curves for small data set 16_90pct.

The Figure 5.2 shows scoring curves for small data set 16_90pct. In the graph, the
x-axis represents jobs and the y-axis is total score. From its upper part, it is clear
that natural order curve and random order curves are very close and do not have
much differences. It is because in the source data, test cases are sorted alphabeti-
cally, thus they are already listed like random order. Meanwhile, the APCIT scores
are much lower than the scores of both natural and random order, and also near to
optimal results on expense of very high computational effort.

In order to distinguish the curve better, we also zoomed in APCIT and optimal
curves as the Figure 5.3. As shown in the second figure, although these two curves
do not completely overlap, they are very close to each other and their trends of
changing are also the same. Their heap scores are much less than natural or random
order. Moreover, not overlapping also means the results are not overfitting. There-
fore, we believe that the prediction is very accurate, and we can apply this method

28


5. Results

to large data set 14_all.

The following figure shows scoring curves for the large data set, it has the same
expressions and similar characteristics as the previous figures.

Figure 5.4: The scoring curves for large data set 14_all.

Figure 5.5: The magnified scoring curves for large data set 14_all.

29


5. Results

The figures illustrate that the curve of APCIT is far superior to the curves of natu-
ral and random order, while also very close to the optimal one. However, since the
amount of data is very large, the scores are quite high which lead to many overlaps,
even in the enlarged part. But it also shows that the results of APCIT and optimal
are very close, which proves the accuracy and reliability of our predictions.

30


6
Discussion

6.1 Bottlenecks and Challenges

We think the biggest bottleneck in this study was during the first half phase. At
that time, we were not aware of the need to use powerful hardware from the case
company, and thus, we were unable to carry out analysis of large-scale data nor do
proper testing. After we realized the this need and having had a discussion with
the case company, we were able to mitigate this problem and remove the bottleneck
with their help.

The biggest challenge in this study was to find an efficient way to handle and an-
alyze the data sets. Each execution of scripts would cost a lot of time. Although,
building up models using the small data set 16_90pct only took around two hours
and twenty minutes for evaluation and prediction, finishing the tasks for the first
300 jobs in large data set 14_all took around one week and three hours respectively.
We assume generating models for the entire large data set 14_all would need around
two months. Multiple debugging and testing iterations were costly and difficult to
apply during the development process, so we had to be careful with making every
change to the scripts during the design process and make reasonable arrangements
for the schedule.

Besides, the built models are non-updateable. Even though the frequency of how
frequent the model needs to be rebuilt and how large the training set shall be are
good questions, and need to be answered before the method is applied in industry,
we were not able to verify them within this thesis.

Further, since the analyzed data sets contain relatively large amount of informa-
tion, the analysis actions require strong hardware support. During the process of
parsing 300 instances, the physical memory used for building models were around
7GB, where the final models occupied around 13GB of disk space.

31


6. Discussion

6.2 Trade-offs

6.2.1 Selection of Algorithms
As explained in Section 2.2.2, we decided to use RandomForest as our main focus
after studying the relevant literature. But we did make another attempt since we
found that Weka supports an algorithm named NaiveBayesUpdateable, which can
not only support quick updates to the model but also computes the results rapidly.
The figure below shows a part of the MCC results from using the NaiveBayes clas-
sifier.

Figure 6.1: The MCC results of NaiveBayesUpdateable.

In Figure 6.1, it shows that many MCC values for test cases can be calculated,
but they are between -0.19 and +0.19, which means these results are like random
guessing and not credible. By comparing with the outcomes of the results from
RandomeForest, we decided to abandon using NaiveBayes.

6.2.2 Filtering Data Sets
In the early stages of the study, we analyzed the small data set. By counting, there
were 299 test cases and only 55 of them had failed. As discussed in Section 6.1, the
analysis program requires a lot of time. For the purpose of time-saving, we removed
all none-failure test cases before using Weka for data mining. However, in later
experimenting process, we found that those removed test cases did not consume too

32


6. Discussion

much time to process. For the large data sets, the analysis time of such test cases
is only tens of seconds. Thus in order to maintain integrity of data sets, we skipped
filtering of test cases in the rest of the study.

6.2.3 Scoring System
As mentioned before, our scoring system referred to the calculation of Average Per-
centage of Faults Detected (APFD), and we believe it is scientific. We used our own
methodology instead of APFD since we found the latter is improper for the use case
in this study. According to Praveen [29], the formula of APFD is:

Where the value “m” is the number of faults contained in the program under test P,
the value “n” is the total number of test cases, and the value “TFi” is the position
of the first test in T that exposes fault “i”.

This APFD formulation can only be implement when at least one failed test case
was included in each job. In our use case, most of the test cases never fail (reference
heat maps in Section 4), thus the m value in APFD would be zero and the entire
calculation cannot be carried out. The same problem does not exist in our scoring
system because we are only grading and counting failed test cases. Due to this, we
did not completely apply APFD in this study.

6.3 Feasibility, Applicability and Practicality
Since doing data mining on the source data sets resulted acceptable MCC values,
we believe that the method of predicting the failure probability of test cases and
to prioritize them by mining the historical data of changed files and associated test
case results is feasible.

Meanwhile, by doing data selection, preprocessing, transformation, data mining and
evaluation, we produced reasonable and valuable results that also recognized by the
case company. Thus we argue that the result of this study is applicable and valuable.

In addition, the scoring curves show in Section 5 also confirm that the method
described in this report can provide very accurate predictions, which has a high
practicality.

To answer the RQ1. Statements above prove that the case company is capable
to provide us historical data for automatic classification and prediction.

33


6. Discussion

To answer the RQ3, we presented and discussed our results with the case company.
The company recognized our outcomes and agrees on using this method could be
very helpful. After fully implemented, APCIT can increase productivity of func-
tional testing by decreasing the feedback time from failing test cases.

6.4 Restrictions
As mentioned in Section 6.1, the time for this study is very limited. Therefore, it
is difficult for us to make a deep analysis or verification of different data sets. In
this situation, we cannot build up a complete program for data mining and analysis.
But instead, we do provide a working prototype and satisfactory conclusions based
on the given use cases.

Since this method has relatively high demands on hardware, especially the phys-
ical memory, a personal computer would have difficulties with carrying out the
processing of large data sets. At the same time, due to this method is very time
consuming, updates of the models could also be very expensive, thus built models
are not likely to be updated frequently, which could limit its scope of application.
We have selected what we believe is the most efficient way from existing algorithms,
if in the future, a more efficient algorithm is developed, these restrictions might be
solved. All above are the main limitations that we found during our study.

The answer for RQ1 in Section 6.3 already prove that APCIT is suitable for the
use case. The discussion in this section supplementary answered the RQ2, which
proposes some notes when this method is applied to other use cases.

6.5 Threats to Validity
We performed semi-structured interviews and non-systematic literature review in
this study, because they can help us to correctly understand the use case and gain
basic domain knowledge. They are not our main focuses, instead, we highlighted
our processes of building up models and evaluation.

This study is based on the use case of the case company. Data distribution in
analyzed source data sets are very uneven, most of the test cases tend to pass. Thus
we only selected RandomForest for our data analysis. In other cases, if a more even-
distribution historical data was used for processing, other algorithms might also be
able to build excellent models. Therefore, we cannot guarantee that our method
will applicable for other use cases.

Our method is designed for and tested with the use case of the case company.
The risks of applying APCIT to other use cases are mentioned in Section 6.4.

34


6. Discussion

This study was conducted by two master students of software engineering domain.
During the research processes, they monitored and promoted each other to guaran-
tee the accuracy and reliability of the results. Meanwhile, the final outcomes were
verified by the case company which further enhance their value.

35


7
Conclusion

This study was designed to help the case company to prioritize their test cases. The
company will not deliver products to their customers until all tests are executed
and passed. However, during development, it might be good to not always run all
test cases. Also, if all tests shall be run, it could be very helpful for giving feedback
as fast as possible. And by prioritizing test cases that are likely to fail, the feed-
back loops can be reduced. We tried to make prioritization by focusing on finding
the relevance between results of test case executions and changed files by using the
technology of data mining. As the result, after successfully predicted probabilities
to fail for all test cases in different jobs, we are able to provide prioritization lists
for all test cases in different jobs.

Our APCIT method can provide accurate predictions. After inspection, the ac-
curacy of prioritization is far superior to natural and random order. In order to use
this method, a user needs to extract historical data in a certain format, and then
executes the corresponding scripts to complete all the remaining steps, including
transforming the format data, building up the models, obtaining and verifying the
results. The built models can then be utilized to generate prioritization list for new
data that include changed files and related test case results until the model out of
date. But it is worth mentioning that the process of building models takes a lot of
time and the models are not easily updatable, users need to weigh whether and how
to apply it according to their specific situations.

Overall, after processing as much different data sets as possible in limited time
and verifying their results scientifically and rationally, we believe that APCIT can
provide excellent test case prioritization for the use case of the case company, and
it is a highly desirable method. Also, this method has the potential to be further
implemented and applied to other use cases.

7.1 Future Work
In this thesis, we focus on probing the feasibility, applicability and practicality of
APCIT by analyzing the characteristics of source data sets from the case company
and providing prioritization lists based on prediction of failure probabilities for test
cases in different jobs. As discussed in Section 6.1, parsing data sets is very time-
consuming and we can only study on a few cases. If we were to continue with this

36


7. Conclusion

study in the future, we would be able to process more data sets and evaluate the
models with various of test sets, which further help us to examine how often the
models need to be updated (regenerated).

It would also be interesting to have further discussions with the case company on
how good (accurate) prediction is required. Our method can provide precise fore-
casting, but if the company was less critical with the accuracy of predictions, then,
finding other solutions that can parse data faster could be another valuable study.
Meanwhile, once models have been generated, prioritizing new inputs would only
costs relatively short time, which makes daily execution feasible. Thus investigating
how often the evaluation shall be applied with the company would also be another
valuable future study.

37


Bibliography

[1] Martin Fowler (2006). Continuous Integration. Available from:
<http://www.dccia.ua.es/dccia/inf/asignaturas/MADS/2013-
14/lecturas/10_Fowler_Continuous_ Integration.pdf>. [8 January 2016].

[2] Sean Stolberg (2009). Enabling Agile Testing through Continuous Integration.
IEEE Conference Publications.

[3] Agneta Nilsson, Jan Bosch, and Christian Berger (2014). Visualizing Testing
Activities to Support Continuous Integration: A Multiple Case Study. Springer
International Publishing Switzerland.

[4] Adam Debbiche, Mikael Diener, and Richard Berntsson Svensson (2014). Chal-
lenges When Adopting Continuous Integration: A Case Study. Springer Inter-
national Publishing Switzerland.

[5] Rudolf Ramler, Michael Felderer (2016). A Process for Risk-Based Test Strategy
Development and Its Industrial Evaluation. Product-Focused Software Process
Improvement.

[6] Robert Feldt, Greger Wikstrand, Jeevan Kumar Gorantla, Wang Zhe (2009).
Dynamic Regression Test Selection Based on a File Cache - An Industrial Evalu-
ation. International Conference on Software Testing Verification and Validation.

[7] Linlin Wang (2015). Implementation and Evaluation of an Automatic Recom-
mender for Integration. Chalmers University of Technology.

[8] Osmar R. Zaiane (1999). Introduction to Data Mining. CMPUT690 Principles
of Knowledge Discovery in Databases.

[9] Jiawei Han, Micheline Kamber (2001). Data Mining: Concepts and Techniques.
London:Academic Press, 5.

[10] Ian H. Witten, Eibe Frank (2005). Data Mining - Practical Machine Learning
Tools and Techniques - 2nd Ed. Morgan Kaufmann Publishers is an imprint of
Elsevier.

[11] Felderer M, Schieferdecker I (2014). A taxonomy of risk-based testing. Interna-
tional Journal on Software Tools for Technology Transfer.

[12] Tin Kam Ho(1995). Random Decision Forest. Proceedings of the Third Inter-
national Conference on Document Analysis and Recognition, vol.1, p.278-282.

[13] Tin Kam Ho (1998). The Random Subspace Method for Constructing Decision
Forests. IEEE Transaction Pattern Analysis and Machine Intelligence 20 (8),
p.832-844.

[14] Huotao Deng, George Runger, Eugene Tuv (2011). Bias of importance measures
for multi-valued attributes and solutions. Proceedings of the 21st International
Conference on Artificial Neural Networks.

38


Bibliography

[15] David Martin (2011). ’Evaluation: From Precision, Recall and f-measure to
ROC, Informedness, Markedness & Correlation. Journal of Machine Learning
Technologies. 2 (Issue 1), p.37-63.

[16] Deborah J. Rumsey (2011). Statistics For Dummies, 2nd Edition. Wiley Pub-
lishing, ch 10.

[17] Ahmed Hassan (2008). The road ahead for mining software repositories. FoSM:
Frontiers of Software Maintenance, p.48-57.

[18] Ahmed Hassan (2006). Mining Software Repositories to Assist Developers and
Support Managers. ICSM 2006: 22nd IEEE International Conference on Soft-
ware Maintenance.

[19] Mathias Meyer (2014). Continuous Integration and Its Tools. IEEE SOFT-
WARE, vol. 31, issue 3.

[20] Ripon K. Saha, Lingming Zhang, Sarfraz Khurshid, Dewayne E. Perry (2015).
An Information Retrieval Approach for Regression Test Prioritization Based on
Program Changes. ICSE 2015.

[21] Cem Kaner, Jack Falk, Hung Quoc Nguyen (1999). Testing Computer Software
2nd Ed. Wiley Computer Publishing, p.480.

[22] Shalal-Esa, Andrea, Maureen Bavdek (2013). Investigators to travel to Pratt
facility in Phoenix: NTSB. Reuters.

[23] Kaner, Falk, Nguyen (1999) Testing Computer Software. Wiley Computer Pub-
lishing, p.42.

[24] Paul Ammann, Jeff Offutt (2008). Introduction to Software Testing. Cambridge,
p.215.

[25] Tin Kam Ho (2002). A Data Complexity Analysis of Comparative Advantages
of Decision Forest Constructors. Pattern Analysis and Applications, p.102–112.

[26] Tao Shi, Steve Horvath (2006). Unsupervised Learning with Random For-
est Predictors. Journal of Computational and Graphical Statistics, vol.15(1),
p.118–138.

[27] Andy Liaw, Matthew Wiener (2002). Classification and Regression by random-
Forest. R News, Vol.3(2), p.18-22.

[28] Gregg Rothermel, Roland H. Untch, Chengyun Chu, Mary Jean Harrold (2001).
Prioritizing Test Cases For Regression Testing. IEEE Transactions on Software
Engineering, Vol.27(10).

[29] Praveen Ranjan Srivastava (2005). Test Case Prioritization. Journal of Theo-
retical and Applied Information Technology.

[30] Yafeng Lu, Yiling Lou, Shiyang Cheng, Lingming Zhang, Dan Hao, Yangfan
Zhou, Lu Zhang (2016). How Does Regression Test Prioritization Perform in
Real-World Software Evolution?. ICSE 2016.

[31] Eibe Frank, Mark Hall, Geoffrey Holmes, Richard Kirkby, Bernhard Pfahringer,
Ian H. Witten, Len Trigg (2005). WEKA - A Machine Learning Workbench for
Data Mining. Data Mining and Knowledge Discovery Handbook, p.1305-1314.

39


	Introduction
	Background and Related Work
	Background
	Case Company
	Continuous Integration and Testing
	Data Mining

	Related Work
	Mining Software Repositories (MSR)
	Test Case Prioritization
	Average Percentage of Faults Detected (APFD)


	Research Methods
	Research Purpose
	Research Questions
	Research Methodology
	Interviews
	Literature Reviews

	Data Obtaining and Processing
	Data Mining Tool
	Data Structure
	Data Transformation

	Measurements
	Confusion Matrix
	Measurement Definitions


	Data Analysis
	Familiarizing with Initial Data Sets
	Processing the Data Set with Small Size
	Processing the Data Sets with Large Size

	Results
	Discussion
	Bottlenecks and Challenges
	Trade-offs
	Selection of Algorithms
	Filtering Data Sets
	Scoring System

	Feasibility, Applicability and Practicality
	Restrictions
	Threats to Validity

	Conclusion
	Future Work

	Reference