Improving Continuous Integration
Feedback Flow

A Design Science Study

Master’s thesis in Computer science and engineering

Christian Lind

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2024


Master’s thesis 2024

Improving Continuous Integration
Feedback Flow

A Design Science Study

Christian Lind

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2024


Improving Continuous Integration Feedback Flow
A Design Science Study
Christian Lind

© Christian Lind, 2024.

Supervisor: Miroslaw Staron, Department of Computer Science and Engineering
Advisor: Patrik Firek, Zenseact
Examiner: Eric Knauss, Department of Computer Science and Engineering

Master’s Thesis 2024
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Typeset in LATEX
Gothenburg, Sweden 2024

iv


Improving Continuous Integration Feedback Flow
A Design Science Study
Christian Lind
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
Continuous integration represents a prevalent practice involving the automated
merging of code modifications from various contributors into a unified software
project. Despite its widespread adoption, this process often entails considerable
time and is susceptible to failures. Consequently, efforts have been directed towards
anticipating the outcome of the continuous integration process prior to its initia-
tion. This thesis explores the feasibility of predicting the outcome in near-real-time,
leveraging the data accessible within the continuous integration job at that specific
moment, employing a design science research approach across three iterative cycles.

Utilizing the design science research approach, the thesis initially delved into the
issue by gathering data through interviews and a concise literature review. This
process resulted in identifying the problem of delivering improved and swifter feed-
back to developers. The literature review also unearthed prior efforts aimed at
addressing the same issue, prompting an exploration into employing machine learn-
ing to forecast build outcomes based on continuous integration (CI) job log data.
The outcomes of evaluating various algorithms spurred both empirical and qualita-
tive/quantitative analyses, augmented by interviews with developers at Zenseact.

The primary contribution lies in the crafted artifact itself, a significant addition
to the realm of predicting the outcome of continuous integration job builds, serv-
ing as a practical solution validated within an industrial setting. This artifact not
only introduces innovative resolutions to recognized challenges but also enriches the
repository of design science knowledge.

Keywords: continuous integration, machine learning, just-in-time prediction, design
science research

v


Acknowledgements
I would first and foremost like to thank my academic supervisor, Miroslaw Staron,
and my industrial supervisor, Patrik Firek, for helping me during the project with
any challenges that occurred. I would also like to thank David Friberg for guidance
on legal issues and for making the necessary preparations for the thesis. Lastly, I
would like to thank the rest of the Overflow team and all the developers at Zenseact
who supported me in any way while working on the thesis. Christian Lind,

Gothenburg, June 2024

vii


Contents

List of Figures xi

List of Tables xv

1 Introduction 1
1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Purpose of the study . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Limitations and Delimitations . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Significance of the study . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background 7
2.1 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.4 Overfitting and Underfitting . . . . . . . . . . . . . . . . . . . 13
2.1.5 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.6 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Traditional machine learning . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Random forest classifier . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Gradient boosting . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Activation functions . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.3 Convolution Neural Network . . . . . . . . . . . . . . . . . . . 20
2.3.4 Long short-term memory . . . . . . . . . . . . . . . . . . . . . 22

3 Related work 25
3.1 Traditional machine learning in CI . . . . . . . . . . . . . . . . . . . 25
3.2 Deep learning in CI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Just-in time defect prediction . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Bringing feedback to developers . . . . . . . . . . . . . . . . . . . . . 29

4 Research Design 31
4.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.1 Machine learning model . . . . . . . . . . . . . . . . . . . . . 32

ix


Contents

4.1.2 Interface for developers . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.1 Computational experiments . . . . . . . . . . . . . . . . . . . 35
4.2.2 Interview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Artifact 37
5.1 Feedback flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Prediction system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6 Findings 41
6.1 First iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.1.1 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.1.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.2 Second iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.2.1 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.3 Third iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3.1 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.3.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.4 Iteration 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.4.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.5 Iteration 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.5.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

7 Discussion 113
7.1 Threats to validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

8 Conclusion 121

Bibliography 123

A Interview Template I

B Appendix 2 VII

x


List of Figures

2.1 Example of how the gradient descent algorithm tries to reach zero. . . 13
2.2 CNN network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Basic RNN structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.1 Flowchart of what is explained in the bullet list. . . . . . . . . . . . . 33

5.1 Flowchart of feedback loop. . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Flowchart of how the predictor functions. . . . . . . . . . . . . . . . . 40

6.1 Confusion matrix comparison of RF and GB on Each Row model. . . 43
6.2 Confusion matrix comparison of RF and GB on the Whole Log model. 44
6.3 Confusion matrix comparison of RF and GB on the Logtime model. . 45
6.4 Time and line progression for each test sample in job 1. . . . . . . . . 49
6.5 Comparison of predicting on each line compared to on only every

tenth line. The RF classifier is used togheter with the char tokenizer. 50
6.6 Comparison of different tokenizers on Whole Log model with MCC

scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.7 Logtime model’s MCC score using several types of tokenizers. . . . . 51
6.8 Difference between accuracy when predicting unsuccessful and suc-

cessful jobs running on the Logtime model. . . . . . . . . . . . . . . . 53
6.9 Number of incorrect predictions using the RF classifier and char to-

kenizer where accuracy is represented by the blue line, furthermore
the number of incorrect predictions is depicted by the orange line over
amount of lines used in prediction. . . . . . . . . . . . . . . . . . . . 53

6.10 Number of jobs that are left after a certain number of lines has been
reached. Where accuracy is represented by the blue line, furthermore
the number of jobs remaining is depicted by the orange line over
amount of lines used in prediction. . . . . . . . . . . . . . . . . . . . 54

6.11 Impact of keeping and removing the timestamp from the logs running
on the Logtime model. . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.12 Time and line progression for each test sample in job 2. . . . . . . . . 61
6.13 Number of jobs that are left after a certain number of lines has been

reached in job 2. Accuracy is represented by the blue line as well as
the number of jobs remaining depicted by the orange line over amount
of lines used in prediction. . . . . . . . . . . . . . . . . . . . . . . . . 61

6.14 Time and line progression for each test sample in job 3. . . . . . . . . 62

xi


List of Figures

6.15 Number of jobs that are left after a certain number of lines has been
reached in job 3. Where accuracy is represented by the blue line as
well as the number of jobs remaining depicted by the orange line over
amount of lines used in prediction. The combination used for getting
the accuracy metric is RF with the char tokenizer and equal classes. . 62

6.16 How balancing data on equal classes, 70/30 split in favor of good
samples and all samples used affects the MCC score of the different
classifiers and tokenizers when running on the Logtime model using
MCC as metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.17 How balancing data on equal classes, 70%/30% split and all samples
used affects the accuracy of the different classifiers and tokenizers
when running on the Logtime model using accuracy as metric. . . . . 65

6.18 How balancing data on equal classes and 70%/30% split affects the
different classifiers and tokenizers when running on Logtime model. . 67

6.19 GridSearchCV best hyperparameters versus default hyperparameters
for all tokenizers and classifiers on the Logtime model on job 1 using
MCC as metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.20 GridSearchCV best and default hyperparameters for all tokenizers
and classifiers on the Logtime model on job 1 using accuracy as metric. 71

6.21 GridSearchCV best hyperparameters versus default hyperparameters
for all tokenizers and classifiers on the Lines model using MCC as
metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.22 GridSearchCV best versus default hyperparameters for all tokenizers
and classifiers on the Lines model on job 1 using accuracy as metric. . 76

6.23 Best hyperparameters for each combination of classifier and tokenizer
on job 1, showing results of predictions on the Lines model for all
datasets, successful jobs and unsuccessful jobs using accuracy as metric. 79

6.24 Best parameters for each combination of classifier and tokenizer, show-
ing results of predictions on equal classes for the Logtime model run-
ning on job 2 using MCC as the metric. . . . . . . . . . . . . . . . . . 81

6.25 Best parameters for each combination of classifier and tokenizer, show-
ing results of predictions on equal classes for the Logtime model run-
ning on job 2 using accuracy as the metric. . . . . . . . . . . . . . . . 82

6.26 Best parameters for each combination of classifier and tokenizer, show-
ing results of predictions on equal classes for the Logtime model run-
ning on job 3 using MCC as metric. . . . . . . . . . . . . . . . . . . . 84

6.27 Best parameters for each combination of classifier and tokenizer, show-
ing results of predictions on equal classes for the Logtime model run-
ning on job 3 using accuracy as metric. . . . . . . . . . . . . . . . . . 85

6.28 Best hyperparameters for each combination of classifier and tokenizer,
showing results of predictions on equal classes for the Lines model
running on job 2 using the MCC metric. . . . . . . . . . . . . . . . . 87

6.29 Best parameters for each combination of classifier and tokenizer, show-
ing results of predictions on equal classes for the Lines model running
on job 2 using the accuracy metric. . . . . . . . . . . . . . . . . . . . 88

xii


List of Figures

6.30 Best parameters for each combination of classifier and tokenizer, show-
ing results of predictions on all datasets for the Lines model running
on job 3 with equal classes. . . . . . . . . . . . . . . . . . . . . . . . . 90

6.31 Best layer and epoch count for each combination of classifier and
tokenizer, showing results of predictions for the Logtime and Lines
models running on job 1 with equal classes using MCC as measurement. 96

6.32 Best layer and epoch count for each combination of classifier and
tokenizer, showing results of predictions for the Logtime and Lines
models running on job 1 with equal classes using accuracy as mea-
surement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.33 Best layer and epoch count for each combination of classifier and
tokenizer, showing results of predictions for the Logtime and Lines
models running on job 2 with equal classes using MCC as measurement. 99

6.34 Best layer and epoch count for each combination of classifier and
tokenizer, showing results of predictions for the Lines model running
on job 2 with equal classes. . . . . . . . . . . . . . . . . . . . . . . . . 100

6.35 Best layer and epoch count for each combination of classifier and
tokenizer, showing results of predictions for the Logtime and Lines
models running on job 3 with equal classes using MCC as measurement.102

6.36 Best layer and epoch count for each combination of classifier and
tokenizer, showing results of predictions for the Logtime and Lines
models running on job 3 with equal classes. . . . . . . . . . . . . . . . 104

A.1 Link to CI system inside Gerrit commit comment leading to artifact. III
A.2 Showing how likely the prediction system thinks the build is going to

fail after how many lines of log is printed to the log. . . . . . . . . . . III
A.3 Showing how likely the prediction system thinks the build is going to

fail after how many lines of log is printed to the log. . . . . . . . . . . III
A.4 Link to CI system inside Gerrit commit comment leading to marked

text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV
A.5 Example of log text that are marked in red as the prediction system

thinks this will cause the build to fail. . . . . . . . . . . . . . . . . . . IV
A.6 Example message sent to the user from a botwith a link leading to

an artifact in the CI system. . . . . . . . . . . . . . . . . . . . . . . . IV
A.7 Prediction button when no prediction is available. . . . . . . . . . . . V
A.8 Prediction button when a prediction is available. . . . . . . . . . . . . V
A.9 Message shown to the user when clicking the predictions button. . . . V
A.10 Icon to notify the developer a prediction for the edited code is avaliable. V

B.1 Comparison of different tokenizers on Whole Log model with accuracy.VII
B.2 Logtime model’s accuracy using several types of tokenizers. . . . . . . VII
B.3 Difference between predicting unsuccessful and successful jobs run-

ning on the Whole Log model with different tokenizers. . . . . . . . . VIII
B.4 Impact on MCC score when keeping and removing the timestamp

from the logs running on the Whole Log model. . . . . . . . . . . . . IX
B.5 Impact on accuracy when keeping and removing the timestamp from

the logs running on the Logtime model. . . . . . . . . . . . . . . . . . X

xiii


List of Figures

B.6 Impact on accuracy when keeping and removing the timestamp from
the logs running on the Whole Log model. . . . . . . . . . . . . . . . XI

xiv


List of Tables

4.1 Performance impact on training with different classifiers, models and
tokenizers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.1 Combinations evaluated in the first iteration. . . . . . . . . . . . . . . 42
6.2 Accuracy metrics when running model Each Row with the LabelEn-

coder tokenizer. RF metrics to the left and GB to the right. . . . . . 43
6.3 Accuracy metrics when running Whole Log model with the LabelEn-

coder tokenizer. RF metrics to the left and GB to the right. . . . . . 44
6.4 Accuracy metrics when running Logtime model with the LabelEn-

coder tokenizer. RF metrics to the left and GB to the right. . . . . . 45
6.5 Combinations evaluated in the second iteration. . . . . . . . . . . . . 48
6.6 Performance impact on training with different classifiers, models and

tokenizers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.7 Combinations evaluated in the third iteration. . . . . . . . . . . . . . 60
6.8 Number of good and bad samples in the different jobs. . . . . . . . . 61
6.9 DTW distance between default and best hyperparameters as well as

the performance impact on training with different classifiers, tokeniz-
ers and hyperparameters on the Logtime model. . . . . . . . . . . . . 72

6.10 DTW distance between best hyperparameter and default as well as
the performance impact on training with different classifiers, tokeniz-
ers and hyperparameters with the Lines model. . . . . . . . . . . . . 80

6.11 DTW distance between best and default hyperparameters as well as
the performance impact on training with different classifiers, tokeniz-
ers and best hyperparameters with the Logtime model on job 2. . . . 83

6.12 Performance impact on training with different classifiers, tokenizers
and hyperparameters with the Logtime model on job 3. . . . . . . . . 86

6.13 DTW distance between the best hyperparameters and the default
hyperparameters as well as the performance impact on training with
different classifiers, tokenizers and hyperparameters with the Lines
model on job 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6.14 DTW distance between the best hyperparameters and the default
hyperparameters as well as the performance impact on training with
different classifiers, tokenizers and hyperparameters with the Lines
model on job 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.15 Combinations evaluated in the fourth iteration. . . . . . . . . . . . . 94

xv


List of Tables

6.16 Performance impact on training with different DL classifiers, models,
tokenizers and best hyperparameters on job 1. . . . . . . . . . . . . . 98

6.17 Performance impact on training with different DL classifiers, models,
tokenizers and best hyperparameters on job 2. . . . . . . . . . . . . . 101

6.18 Performance impact on training with different DL classifiers, models,
tokenizers and best hyperparameters on job 3. . . . . . . . . . . . . . 104

A.1 Interview questions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . I

xvi


1
Introduction

As software repositories become bigger with more and more tests added the time
taken for a continuous integration (CI) job to complete continues to increase. Af-
ter committing changes to a version control system, developers are often eager to
promptly receive feedback from the CI job regarding the success or failure of their
commit [1]. When a CI job takes over 30 minutes [2] to complete the developer loses
focus and the workflow is disrupted. They also desire concise and explanatory data
to promptly identify the specific failure that led to the erroneous return of the CI
job [3].

To address the aforementioned challenges, predicting the build outcome in CI has
emerged as a prominent area of interest. Previous studies have primarily focused
on predicting job outcomes based on commit metadata. However, scant attention
has been paid to predicting the job outcome while the CI job is actively running.
During this phase, additional parameters become available, which were not consid-
ered in earlier research efforts. Consequently, leveraging these parameters during
job execution should enhance the accuracy of outcome predictions.

The primary objective of this study is to identify and implement enhancements
to the developer workflow, particularly focusing on expediting feedback from the
CI system. This study will be conducted in collaboration with the observability
team at Zenseact, a company dedicated to advancing fully autonomous vehicles.
Collaborating with Zenseact offers the advantage of leveraging their established CI
infrastructure, along with access to tens of thousands of saved logs from previous
CI job runs.

1.1 Problem Description
A developer will often want a fast response as to why the code has failed to be inte-
grated with the CI system. This is because they first want to see why the commit
in question failed the tests. CI systems today therefore employ techniques that are
supposed to make builds faster using cache, multithreading and optimizing code as
much as possible [4]. What all of these methods have in common is that they can
break an already working system. Even with all these techniques in a large-scale
product, the CI jobs can still take multiple hours to finish [4]. Where eagerness for
quick feedback may turn into frustration over a slow completion. The pipeline job
may also have some problems which are not that easy to spot for a developer not

1


1. Introduction

that invested into how the CI system works [5]. This can mean a pipeline job that
takes longer than usual might cause warning signs for newcomers while in reality, it
may for example only be that a new update needs to be installed.

One problem for developers is that a single pipeline job can take multiple hours
to complete, with over 40% of all jobs reportedly taking more than 30 minutes [2].
This can lead to long wait times for commits to finally be merged into the repository
as each new commit has to go through a new integration process until all stages in
the CI job pass. If there are multiple bugs in the code that the developer has not
fixed this might require multiple commits in order for all fixes to be applied depend-
ing on when the developer finds the bugs and how long it takes to correct them. As a
result of long build times developers tend to lose focus and this can hurt productivity
and parallel development [6]. Long build times can also lead to more computational
resources needed in order to complete the build process before a failure is discovered.

Waiting for a CI build to finish or starting another project on the side while waiting
for a CI build to finish can negatively affect the productivity of a developer [7].
To remedy this problem machine learning (ML) algorithms can be used in order to
predict the outcome of the build [8]. This can help developers in predicting if a
build will fail or pass and thus help the developer know the outcome of the build
beforehand. However, this approach in ML cannot help in finding where a failure
might occur in the code they have uploaded. This can lead to the developer having
to spend a lot of time looking through the code to find the failure. The failure might
not be that clear at first before running the pipeline job and getting the result, re-
sulting in significant time wasted by the developer.

Another problem for most developers, a problem is how verbose a log should be.
When running all checks in a CI job it is typical to also log every step the CI job
takes in order to ensure everything is working accordingly. These logs can for ex-
ample contain the CMake log from building the program with CMake. While these
logs can help the developer find the cause of the build issues, it may not be as clear
as one would expect [9]. A more verbose build log may contain unnecessary details
making it hard to read [10]. On the other hand, a minimal log may not contain all
the necessary information for solving the problem [11].

In large CI systems with a lot of building and testing, it is common to get back
a log with megabytes of data [12]. When a pipeline job fails it will indicate which
stage the failure occurred in but if that stage has a lot of log data, it can still be
very time-consuming for the developer to find the failure [13]. With this, it is also
frustrating for the developer to get back a large log showing a failure that is hid-
den behind thousands of lines of log entries. For failed jobs, it can be challenging
to quickly find the reason why the underlying job failed [13]. This results in an in-
creased cognitive load on developers. It can also lead to questions about what caused
the pipeline to fail [14]. Developers often look for keywords in the logs which might
indicate the failure [14]. But this may be misleading and thus trick the developer
into fixing something that is not broken in the first place or completely unrelated

2


1. Introduction

to the real issue.

A CI DevOps engineer would want to find failures in the CI job itself. While
the developer only cares about the issue that is specific to his commit. This can
create a scenario where the developer must search through large logs to find the
right issues for his build. Zampetti et al. [8] mention that their survey reports dis-
agreement on how large a log should be, later remarking that it is almost impossible
to find a scenario where a more minimalistic log is better. Each passed or failed
CI job is accompanied by a log in textual form. For a pipeline job that compiles
C++ libraries, the log is a C++ build log, and a failed job should contain infor-
mation about why the build failed. For a normal developer working with the C++
code, it might be hard to fully understand how to search the log for the related issue.

Today developers often get to choose where it is relevant to log inside of the code
[10]. This can be problematic as developers have different domain knowledge. Which
can lead to not properly logging in places where an error can occur. This can in turn
lead to logs that contain no information regarding the real problem for why a build
failed and thus will be hard to debug [10]. Such a problem can be seen in Listing
1 where Docker has tried to push a new Docker build but failed because of some
Python script. Usually, there is an exception shown as to why the Python script
has failed but this is not the case in this instance. If a log does not contain the
information needed, then it will be impossible for a developer as well as a prediction
algorithm to correctly find what caused the build to fail. In this case, it is irrelevant
how much log data there is and how fast a developer wants to get the information
as the relevant information does not exist.
&1|20:52:35.222 Running: docker push hub.docker.com/car_platform/
master/71ce2b95d1d76856eceb84/config:src-2.11
&1|20:52:35.265 !! Error when executing ['/var/lib/agent/pipelines/
src/build/a549-4392-19b57c7915cc/venv/bin/python',
'config/jobs/deliver_car_platform.py'], return_code 1
?1|20:52:35.395 [go] Task status: failed (147597 ms) (exit code: 1)

Listing 1: Example of log output with no information on what caused the error.

1.2 Purpose of the study
The purpose of the study is to explore to what extent it is possible to better provide
feedback in near-real-time to developers from a CI system. The study will result
in a proof of concept that entails simulating a CI system that sends streams of
new logs in near-real-time to a model for analysis. This model then provides near-
real-time feedback if the build is likely to fail. Giving developers the chance to
choose between multiple different ways to implement a feedback flow. Thereafter,
the feedback flow is then evaluated based on how developers perceive its usefulness
compared to a normal CI workflow. Additionally, qualitative analysis is conducted to
assess the accuracy and the MCC score of the algorithm in predicting build outcomes

3


1. Introduction

in near-real-time. These findings are then juxtaposed with developers’ expectations
regarding prediction accuracy. Furthermore, considerations such as time savings and
computational resources are factored in to determine the viability of such a solution.
Ultimately, this thesis aims to benefit both developers and researchers by providing
insights into optimizing the CI workflow.

1.3 Limitations and Delimitations

The thesis will focus on providing feedback to the developer in near-real-time while
the CI job is running. It will therefore not feature any study on providing feedback
before or after the job has been run. The feedback will only be provided using
already existing ML classifier, meaning no research on new classifiers or such.

In this thesis, multiple combinations of models and classifiers will be evaluated, but
it is impossible to evaluate all in this thesis’s scope. Classifiers have been chosen
based on performance in others work that are working in a similar area in predicting
a CI build outcome or classifying logs. The models have been created and chosen
based on how log files are typically preprocessed in a similar scenario.

Another limitation is that only CI logs from one company are used. There are
multiple jobs used, but all are provided by a company specializing in making au-
tonomous driving for cars. There are more aspects of a car than what Zenseact is
building and therefore the way to build those pipelines might look different for other
companies. This is also true for any other industry as well when the needs of the
company are different.

1.4 Significance of the study

This study contributes both to the academic field as well as practitioners. In
academia, it explores a novel area by predicting CI job outcomes in near-real-time
using log data generated by a CI system. An area previously only looked at before a
job has been started. It identifies various methods for making such predictions and
evaluates their effectiveness, ultimately presenting an artifact that underscores the
effectiveness inherent in developing such a predictive system.

Contribution towards Zenseact has been made by having a new prediction system
set up and ready for use. Numerous CI jobs from Zenseact have undergone test-
ing, and additional ones can be seamlessly integrated. While the contribution to
Zenseact surpasses that of other companies due to the accessibility of these tools,
the knowledge provided in this report facilitates the replication of similar systems
by other organizations. Furthermore, companies can fine-tune various parameters
to optimize compatibility with their CI systems.

4


1. Introduction

1.5 Thesis outline
In Chapter 1, a concise overview of the thesis subject was presented, including an
exploration of the project’s objectives and limitations.

Chapter 2 introduces relevant terminology and concepts for this thesis.

Chapter 3 presents other studies relevant to this thesis and how their findings will
be used.

Chapter 4 describes the research methodology used in this study.

Chapter 5 describes the developed artifact and its functionality.

Chapter 6 presents the results and analysis of all iterations.

Chapter 7 argues for the findings and answers the research questions.

Chapter 8 draws conclusions for the thesis.

5


1. Introduction

6


2
Background

This chapter introduces important concepts that are integral to the workflow. It
begins by exploring the fundamentals of machine learning (ML) and the evalua-
tion metrics used throughout the thesis, it then proceeds to first discuss traditional
machine learning and later on deep learning (DL) techniques, elucidating their re-
spective functionalities.

2.1 Machine learning
Machine learning (ML) is a branch of artificial intelligence (AI) that empowers com-
puters to acquire intelligence and perform tasks without explicit programming. ML
is used to simulate human learning activities [15] for obtaining new information and
skills in order to continuously improve knowledge. ML algorithms can be broadly
categorized into several types, each tailored to different learning scenarios. Super-
vised learning involves learning from labeled data, where the algorithm predicts
output based on input-output pairs provided during training. In contrast, unsuper-
vised learning explores unlabeled data to uncover hidden structures and patterns.
Semi-supervised learning represents a hybrid approach, combining elements of super-
vised and unsupervised learning, and enabling agents to learn through interactions
with environments, respectively.

2.1.1 Data preprocessing
Data encompasses the raw information, whether structured or unstructured, that
is fed into a model to enable learning and decision-making. This data can come
from various sources, such as databases, sensors, text documents, images, or audio
recordings. However, data quality and quantity play crucial roles in determining the
model’s effectiveness and reliability.

Data preprocessing is often the initial step in preparing the data for training with an
ML classifier. This involves tasks like cleaning the data to remove noise and incon-
sistencies, handling missing values, and transforming the data into a format suitable
for the model [16]. Removing noise refers to the process of eliminating irrelevant or
unwanted information from the dataset. Noise can manifest in various forms such as
outliers, errors, or inconsistencies in the data. Removing noise is crucial because it
can adversely affect the performance and accuracy of ML classifiers by introducing
unnecessary variability or bias.

7


2. Background

Additionally, features may need to be extracted or engineered to enhance the clas-
sifier’s ability to learn relevant patterns and relationships within the data. Features
represent individual measurable properties or characteristics of the data used as in-
put for an ML model. Features are essential components of the dataset that provide
information to the classifier, allowing it to learn patterns and make predictions. In
the end the goal of data preprocessing is to find the best set of features for the
classifier used [17]. The two most relevant techniques for this thesis are addressing
data imbalance and handling outliers.

Addressing data imbalance
In classification problems, imbalanced datasets occur when one class of the target
variable is significantly more prevalent than others [18]. Data cleaning may involve
techniques such as resampling (oversampling minority class or undersampling ma-
jority class) to address data imbalance and improve the performance of the model.
Oversampling involves increasing the number of instances in the minority class by
randomly replicating them or generating synthetic samples. Undersampling involves
reducing the number of instances in the majority class by randomly selecting a sub-
set of instances. This can help balance the class distribution, but it may also result
in loss of information. Combining oversampling and undersampling techniques can
be beneficial to balance the dataset while minimizing the loss of information.

Handling outliers
Outliers are data points that deviate significantly from the rest of the dataset. Out-
liers can skew statistical analyses and affect the performance of ML models. Tech-
niques for handling outliers include trimming (removing extreme values), capping
(replacing extreme values with a predefined threshold) or transforming the data to
be more robust to outliers. When working with log data not all output is normal
text, this can be due to numerous varied factors, either it is intentional or there can
be some bug in the system. These occurrences can in this case be categorized as
what are known as "Error outliers" [19] which are points in a dataset that is not of
interest to the population. In this case, the population would be text.

2.1.2 Tokenization
When working with strings the data must first be tokenized before being inputted
into a classifier. The process of tokenization is essential when machines need to
understand and process human language. Tokenization involves breaking down un-
structured text, which can be anything from sentences to entire documents, into
smaller units called tokens [20]. These tokens could be individual words, subwords,
or even characters, depending on the specific requirements of the task. For instance,
when tokenizing a sentence, each word might become a token, or the words might
be further segmented into chars. After tokenization, the tokens are converted into
numerical representations through vocabulary building, which assigns a unique in-
dex to each token. To help with understanding how the tokenizers work, each of
the different tokenizers will be accompanies by an example taken from Listing 1

8


2. Background

"?1|20:52:35.395 [go] Task status: failed (147597 ms) (exit code: 1)" showing how
this string is tokenized.

Word tokenization
Word tokenizer scans through the input text character by character, identifying
patterns such as spaces, punctuation marks, and special characters that separate
words or indicate word boundaries. During this process, it detects word boundaries
using criteria like whitespace characters, punctuation marks, and language-specific
rules. Special cases like contractions, hyphenated words, and abbreviations are also
handled. Various word tokenizers treat special cases differently, but this thesis will
utilize the NLTK Treebank word tokenizer [21]. As it processes the text, the tok-
enizer generates a sequence of tokens, each representing a single word or a meaningful
unit of text such as a special character. As can be seen by Listing 2 where spaces
are excluded from being tokenized, instead they only act as a separator.

{0: '(', 1: ')', 2: '1', 3: '147597', 4: '1|20:52:35.395', 5: ':',
6: '?', 7: 'Task', 8: '[', 9: ']', 10: 'code', 11: 'exit',
12: 'failed', 13: 'go', 14: 'ms', 15: 'status', 16: ''}

Listing 2: Tokenized mapping of words.

The final output array of tokenizing the input string would then look like in Listing
3.

[6, 4, 8, 13, 9, 7, 15, 5, 12, 0, 3, 14, 1, 0, 11, 10, 5, 2, 1]

Listing 3: Tokenized input with the word tokenizer.

Character tokenization
Character tokenization works by iterating through each character in the text, in-
cluding letters, digits, punctuation marks, and whitespaces. For each character
encountered, the tokenizer generates a token representing that character. It does
not consider word boundaries or spaces; instead, it treats each character as a sepa-
rate unit. The tokenizer continues this process until it reaches the end of the input
text, producing a sequence of character tokens. For the example text provided above
the character tokenization would produce the mapping shown in Listing 4.

{0: ' ', 1: '(', 2: ')', 3: '.', 4: '0', 5: '1', 6: '2', 7: '3',
8: '4', 9: '5', 10: '7', 11: '9', 12: ':', 13: '?', 14: 'T',
15: '[', 16: ']', 17: 'a', 18: 'c', 19: 'd', 20: 'e', 21: 'f',
22: 'g', 23: 'i', 24: 'k', 25: 'l', 26: 'm', 27: 'o', 28: 's',
29: 't', 30: 'u', 31: 'x', 32: '|', 33: ''}

Listing 4: Tokenized mapping of characters.

The final output array of tokenizing the input string would then look like in Listing
5.

9


2. Background

[13, 5, 32, 6, 4, 12, 9, 6, 12, 7, 9, 3, 7, 11, 9, 0, 15, 22, 27, 16,
0, 14, 17, 28, 24, 0, 28, 29, 17, 29, 30, 28, 12, 0, 21, 17, 23, 25,
20, 19, 0, 1, 5, 8, 10, 9, 11, 10, 0, 26, 28, 2, 0, 1, 20, 31, 23, 29,
0, 18, 27, 19, 20, 12, 0, 5, 2]

Listing 5: Tokenized input with the characters tokenizer.

Byte-Pair-encoding tokenization
Byte Pair Encoding (BPE) tokenization begins with initializing a vocabulary con-
taining all the characters or symbols present in the corpus. Next, the algorithm
iterates over the corpus and identifies the most frequent pair of adjacent characters
or character sequences. It merges the most frequent pair into a new symbol and
updates the vocabulary accordingly. This process continues for a specified number
of iterations or until a certain vocabulary size is reached. As the iterations progress,
the algorithm gradually builds a vocabulary of subword units that represent fre-
quently occurring character sequences in the corpus. During tokenization, the input
text is segmented into subword units based on the vocabulary learned during train-
ing.

The tokenizer then replaces rare or out-of-vocabulary words with a combination
of subword units that are present in the vocabulary. Byte Pair Encoding (BPE)
tokenization is effective for handling rare words, morphologically rich languages,
and out-of-vocabulary terms by breaking them down into smaller, more manageable
subword units. Using ChatGPT4 tokenizer which has already been handed a full
corpus, tokenizing the above message results in the following dictionary shown in
Listing 6 for the above provided sentence.

{30: '?', 16: '1', 91: '|', 508: '20', 25: ':', 4103: '52',
1758: '35', 13: '.', 19498: '395', 510: ' [', 3427: 'go',
60: ']', 5546: ' Task', 2704: ' status', 4745: ' failed',
320: ' (', 10288: '147', 24574: '597', 10030: ' ms',
8: ')', 13966: 'exit', 2082: ' code', 220: ' '}

Listing 6: Tokenized mapping of BPE using ChatGPT4 corpus.

The final output array would then look like in Lisiting 7.

[30, 16, 91, 508, 25, 4103, 25, 1758, 13, 19498, 510, 3427,
60, 5546, 2704, 25, 4745, 320, 10288, 24574, 10030, 8, 320,
13966, 2082, 25, 220, 16, 8]

Listing 7: Tokenized input with the GPT4 tokenizer.

The dictionary when using GPT4 tokenizer can be compared to not already having
a full corpus that would look like it does in Listing 8. Here the full corpus is the log
data that is provided.

10


2. Background

{1: '?1|20:52:35.395', 2: '[go]', 3: 'Task', 4: 'status:',
5: 'failed', 6: '(147597': 1, 'ms)', 7: '(exit', 8: 'code:',
9: '1)'}

Listing 8: Tokenized mapping of BPE using corpus from the provided text.

As the mapping in Listing 8 does not have the same context as ChatGPT4 the tok-
enizer tries to merge pairs of characters, often characters are only once beside each
other in a word in this case.

LabelEncoder
The LabelEncoder is a predefined module within the Python package Scikit-learn,
accessible through the pip library. It accepts input data in the form of an array
and outputs corresponding tokens. Unlike other tokenizers utilized in this thesis,
the LabelEncoder offers versatility by effectively serving as a substitute for various
tokenization methods. For instance, the character tokenizer’s functionality can be
replicated by segmenting the log into an array of characters and feeding it into the
LabelEncoder. The tokenized mapping of the string would therefore look like List-
ing 4 and then the output would be as shown in Listing 5. Without first segmenting
the log into characters the tokenizer would instead output as is shown in Listing 9

{'?1|20:52:35.395 [go] Task status: failed (147597 ms) (exit code: 1)':
0}

Listing 9: Tokenized mapping of the entire input text.

The final output array of tokenizing the input string would then look like in Listing
10.

[0]

Listing 10: Tokenized input with the LabelEncoder.

The reason for this behavior with LabelEncoder is that it necessitates manual di-
vision of the text into smaller inputs by the user. With proper segmentation, it
can perform the same function as other tokenizers. For instance, if the text is di-
vided into two strings within an array beforehand, the resulting output array would
resemble the depiction shown in Listing 11.

[0, 1]

Listing 11: Tokenized output from Labelencoder when two strings are sent in one
array as input.

2.1.3 Training
Once the data is preprocessed and tokenized, it is divided into training and test
sets. The training dataset is used to teach the classifier to recognize patterns [22]

11


2. Background

and make predictions based on the input data. The test dataset is employed to
fine-tune the classifier’s hyperparameters and assess its performance during train-
ing, helping to prevent overfitting, where the classifier memorizes the training data
but fails to generalize to new data.

Training an ML classifier involves feeding the preprocessed data into the classi-
fier and iteratively adjusting its parameters to minimize the difference between the
predicted outputs and the actual outputs. Other options besides hyperparameter
tuning are to change how the preprocessed data is structured. This can be done by
removing unwanted features or using several types of tokenization. This optimiza-
tion process typically involves using algorithms such as gradient descent to update
the classifier’s hyperparameters based on the computed loss or error.

The algorithm starts at any random point with an initial set of parameters [23].
in the example provided in Figure 2.1 the start point is at the cost of 10000. At
that point, the gradient of the terrain is calculated, which tells the algorithm the
direction of the steepest slope. In ML, this gradient represents the direction of the
steepest increase in the loss function, in the case of Figure 2.1 this is towards 0 where
the start point is at 100. A loss function, in the context of ML and optimization
algorithms, is a mathematical function that measures the discrepancy between the
predicted values of a classifier and the actual ground truth values. It essentially
quantifies how well the model is performing on a particular task. The goal of train-
ing a classifier is to minimize this loss function, thereby improving the classifier’s
ability to make accurate predictions.

Once the algorithm has found the direction of the steepest slope, it will try to
take a small step in the opposite direction. This step size is called the learning rate,
and it determines how far it is possible to move in each iteration and this is the
same as each red dot in Figure 2.1. The current position is then updated to the
new position that has been reached after taking the step downhill. This process is
then repeated iteratively [23], recalculating the gradient, taking a step downhill, and
updating the position until a point is reached where the gradient is close to zero or
until a predefined number of iterations has been reached. In Figure 2.1 the number
of iterations was reached before it could reach zero.

12


2. Background

Figure 2.1: Example of how the gradient descent algorithm tries to reach zero.

The training process is iterative and computationally intensive, especially for com-
plex models or large datasets. It often requires significant computational resources,
such as powerful CPUs or GPUs, to expedite the training process. Additionally,
techniques like parallel processing and distributed computing may be employed to
accelerate training and handle large volumes of data efficiently. Having ample data
necessitates having access to an abundant amount of memory since the model re-
mains stored in memory throughout the training process.

Throughout the training process, monitoring and optimization are essential to en-
sure that the model converges to a satisfactory solution and generalizes well to
unseen data. This may involve monitoring performance metrics, adjusting hyper-
parameters, and incorporating feedback from the test set to fine-tune the model’s
architecture and parameters.

2.1.4 Overfitting and Underfitting
When the classifier is training on previously unseen data overfitting and underfitting
can occur. Overfitting and underfitting are fundamental concepts in ML that relate
to how well a classifier learns from training data and generalizes to new, unseen data.

Overfitting
Overfitting occurs when an ML classifier learns the training data too well [24], in-
cluding noise and random fluctuations in the data, to the extent that it negatively
impacts the classifier’s ability to generalize to new, unseen data. Essentially, the
classifier memorizes the training data rather than learning the underlying patterns.
As a result, an overfitted classifier performs very well on the training data but poorly
on new, unseen data. Signs of overfitting include a high accuracy on the training
dataset but a significantly lower accuracy on the validation or test datasets. Over-
fitting often happens when the classifier is too complex relative to the amount and

13


2. Background

quality of the training data, or when the classifier is trained for too many iterations.
To address overfitting many different techniques can be used, these are some of them:

Cross-validation: In certain scenarios, the ML classifier used can show very good
accuracy on one part of a dataset but may not be generalized on other parts of the
dataset [25]. Cross-validation involves partitioning the dataset into multiple subsets,
known as folds. The classifier is trained on a portion of the data and validated on
the remaining folds. This process is repeated multiple times, with each fold serving
as both the training and test set in turn. Performance metrics obtained from each
iteration are then averaged to provide a more robust estimate of the classifier’s per-
formance.

Training with More Data: Increasing the size of the training dataset can help
the classifier to learn the underlying patterns better and reduce overfitting [24]. If
obtaining more data is feasible, it is often one of the most effective ways to mitigate
overfitting. This however comes with the tradeoff that more computing power will
be needed as the ML classifier needs to fit the new data.

Underfitting
Underfitting, on the other hand, occurs when a ML classifier is too simple to cap-
ture the underlying structure of the data. In this case, the classifier fails to learn
the patterns present in the training data and performs poorly on both the training
and unseen data [24]. Underfitting can happen for assorted reasons, such as using
a classifier that is too simple, not providing enough training data, or insufficient
training time. Signs of underfitting include low accuracy on both the training and
validation/test datasets. There are lots of different strategies to mitigate underfit-
ting and the ones the thesis will cover are:

Increase Classifier Complexity: Underfitting happens when the classifier is too
simple to capture the underlying patterns in the data. One way to try and solve this
problem is by increasing the complexity of the model by adding more layers (only
for neural networks), increasing the number of parameters, or using a more complex
algorithm.

Feature Engineering: Feature engineering revolves around the creation, selec-
tion, and transformation of features from the original dataset to facilitate better
classifier learning and prediction [26]. Effective feature engineering entails extract-
ing relevant information, reducing dimensionality, and encoding domain knowledge
into the feature space, thereby enabling classifiers to capture underlying patterns
more effectively. One way of extracting features from text is the use of different
tokenizers which will transform the text in diverse ways.

Hyperparameter Tuning: Hyperparameters are parameters that govern the be-
havior and performance of ML algorithms, distinct from classifier parameters that
are learned from the training data [27]. Hyperparameter tuning involves systemati-
cally exploring and selecting optimal values for these parameters to enhance classifier

14


2. Background

performance and mitigate issues such as overfitting and underfitting. In order to
more easily find what parameters are most suited for the use case, several algorithms
can be used to find the perfect fit. The two most popular algorithms are Random-
SearchCV and GridSearchCV.

GridSearchCV embodies the concept of hyperparameter tuning by exploring a spec-
ified grid of hyperparameter values for a given ML algorithm. It systematically
evaluates the performance of the classifier for all combinations of hyperparame-
ters using cross-validation. On the other hand, RandomizedSearchCV operates on
the premise of hyperparameter optimization by randomly sampling hyperparame-
ter values from specified distributions or ranges. It systematically evaluates the
performance of the classifier for each sampled configuration using cross-validation.
The performance difference between them can be quite large meanwhile the differ-
ence in accuracy is not that big [27]. The search method of choice will therefore
be RandomSearchCV when evaluating different jobs and how the parameters affect
the result. GridSearchCV will however be used in order to get a baseline for the
deviation between the two classifiers for this thesis.

2.1.5 Supervised learning
Supervised learning is a type of ML paradigm where the classifier learns from labeled
data, meaning each input in the dataset is associated with a corresponding output or
target variable [28]. The goal of supervised learning is to learn mapping from input
variables to output variables based on the labeled training data provided. This is
shown by the mathematical function 2.1.

f : x− > y (2.1)
Where the data inputted and outputted into the function in 2.1 is formatted as
follows in equation 2.2.

{(x1, y1), (x2, y2)...(xn, yn)} (2.2)
In supervised learning, the algorithm learns from examples, where it is presented
with input-output pairs and adjusts its internal parameters to minimize the differ-
ence between the predicted output and the actual output. In equation 2.2 the input
is x and the corresponding output is y. For CI classification, y is the status of the
finished job, either 1 for a successful job and 0 for an unsuccessful job. Parameter x
is the input to the classifier, in the case of the CI job this can be the log data, files
edited, commit author etc.

In classification tasks, the output variable is a categorical value or class label. The
goal is to classify input data into predefined categories or classes. Examples of clas-
sification tasks include spam detection in emails, sentiment analysis in text data,
and image classification.

2.1.6 Evaluation metrics
Researches usually resort to using commonly accepted performance metrics while
evaluating the classifier [29]. Some common ones are accuracy, area under the

15


2. Background

curve (AOC), area under the ROC (receiver operating characteristic) curve, and
F-measure. Although these will be used in the thesis for evaluating the work of
other studies, Matthews Correlation Coefficient (MCC) and Dynamic Time Warp-
ing (DTW) are used for evaluation of the results from this thesis.

The F1 score was not used in the thesis as it focuses solely on the positive class
and does not take into account the true negatives. This can be a limitation in sce-
narios where the performance on the negative class is also important as in this thesis
where the purpose is to be able to successfully predict CI builds that are about to
fail. ROC was not used as it tells how well the model ranks positive instances rel-
ative to negative ones, but not about the actual decision boundaries. It is also not
particularly useful for imbalanced classes, hence the decision to use MCC instead
and also why AOC was selected out.

Dynamic Time Warping
Dynamic Time Warping (DTW) is a technique used in the field of time series anal-
ysis to measure the similarity between two sequences that may vary in time or
speed. It is particularly useful when comparing sequences that may have temporal
distortions, such as varying speeds, shifts, or nonlinear distortions. DTW finds an
optimal alignment between the two sequences by warping the time axis, allowing
for the comparison of corresponding points in the sequences, even if they occur at
different times.

Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is a metric used to evaluate the perfor-
mance of classification algorithms, particularly in the context of binary classification
tasks. It takes into account true positives, true negatives, false positives, and false
negatives to provide a balanced measure of a classifier’s performance, even in cases
of class imbalance.

MCC ranges from -1 to +1, where different scores means the result can be in-
terpreted as follows:

• A score of +1 indicates perfect prediction, where there are no false positives
or false negatives.

• A score of -1 indicates perfect disagreement between prediction and observa-
tion.

• A score of 0 indicates random prediction.

2.2 Traditional machine learning
Traditional machine learning encompasses a collection of algorithms and method-
ologies employed to construct models capable of discerning patterns and generating
predictions from data. These classifiers possess the ability to autonomously learn
from data and make predictions or decisions without the need for explicit program-
ming tailored to these tasks [30]. This section provides an overview of the inner
workings of two classifiers, Random Forest (RF) and Gradient Boosting (GB).

16


2. Background

2.2.1 Random forest classifier

Random Forest (RF) operates by constructing an ensemble of decision trees during
the training phase. Each decision tree is built on a random subset of the training
data, employing bootstrapping, or bagging techniques to introduce diversity among
the trees [31]. Moreover, at each node of the decision tree, a random subset of fea-
tures is considered for splitting, enhancing the robustness of the ensemble.

A decision tree is built by asking questions about distinctive features and at the
start, all data is in one big box, which is the root node. Then questions are asked to
split the data into smaller groups based on the answers. These questions are chosen
to maximize the information gain or minimize impurity at each step.

As more questions are asked and data is split, branches are created that lead to
more specific groups of data. Eventually, the tree ends up with smaller boxes, or
leaf nodes, where no more questions are asked because the stopping point has been
reached. In classification, each leaf node represents a class, and in regression, it
represents a predicted value. When a prediction is made for a new data point, the
algorithm starts at the root node and follows the branches based on the answers to
the questions until a leaf node is reached. In classification, the majority class in that
leaf node is the prediction, and in regression, it would be the average of the target
values.

Decision trees are simple and interpretable. It can easily visualize the decision-
making process, understanding which features are most important for making pre-
dictions. However, decision trees can also be prone to overfitting, especially if they
grow too deep, capturing noise in the data instead of true patterns. In order to
eliminate overfitting multiple trees are grouped into a forest.

During the prediction phase, the ensemble of decision trees collectively contributes
to the final output. For classification tasks, RF employs a majority voting mech-
anism, while for regression tasks, it averages the outputs of individual trees. This
aggregation strategy ensures robust and accurate predictions across diverse datasets.

The strengths of RF lie in its ability to mitigate overfitting [31], handle high-
dimensional datasets, and provide insights into feature importance. However, it
may exhibit computational overhead, particularly with large datasets, and may not
perform optimally in the presence of noise or outliers. Additionally, the interpretabil-
ity of RF classifiers may vary depending on the context and complexity of the data.

As noted by Nasir et al. [32] the RF algorithm excels in both classification and
regression tasks, making it an outstanding choice for analyzing log data, especially
textual logs, in this thesis. Its appeal lies in its capability to handle high-dimensional
data efficiently while delivering excellent performance. Moreover, RF is adept at
managing imbalanced datasets, a common scenario in CI job logs where failed logs
contain information not typically present in successful CI job runs.

17


2. Background

2.2.2 Gradient boosting
Gradient Boosting (GB) is a ML technique that is all about teamwork of different
classifiers [33]. GB works by starting with a basic understanding of the problem,
which is like having a simple initial classifier. This initial classifier might not be
very accurate, but it gives a starting point. Then, it looks at where it is making
mistakes. Next, it uses another simple classifier, like a decision tree, that is good at
correcting mistakes made in the initial classifier. This other classifier focuses on the
areas where the first classifier went wrong and helps improve the predictions [33].
The classifier gets adjusted based on the feedback from this new model, which means
the classifier is gradually getting better at solving the problem. This process is then
repeated, bringing in more classifiers one by one, each focusing on various aspects
of the problem and helping refine the predictions further. With each iteration, it
reduces the errors and gets closer to the correct solution

However, GB will try not to rely too much on one specific classifier, because the
classifier could lead to worse predictions. So, instead a concept called "learning rate,"
which controls how much GB listens to each classifier’s prediction [34]. This helps
prevent over-reliance on any single classifier and keeps the predictions balanced and
accurate. This process is then repeated until GB is satisfied with the predictions
or until it has reached a predetermined level of accuracy. Finally, all predictions
are combined, weighing them appropriately based on their quality of predictions, to
arrive at the final prediction.

2.3 Deep learning
Deep learning (DL) is a subset of ML that focuses on using neural networks with
multiple layers to learn representations of data [35]. Unlike traditional ML clas-
sifiers, which may require manual feature extraction, DL classifiers automatically
learn hierarchical representations of data directly from the raw input.

At the core of DL are neural networks, which are computational models inspired
by the structure and function of the human brain instead of the traditional zeroes
and ones used in normal computing tasks [36]. These networks consist of intercon-
nected layers of nodes (neurons), where each node performs a simple mathematical
operation on its inputs and passes the result to the next layer. Neural networks
typically consist of an input layer, one or more hidden layers, and an output layer
[36]. Each layer is composed of multiple nodes, and connections between nodes have
associated weights adjusted during the training process.

2.3.1 Layers
The input layer is the initial layer of a neural network that receives input data and
passes it to the subsequent layers for processing. It consists of neurons that repre-
sent the input features or dimensions of the data. The number of neurons in the
input layer is determined by the dimensionality of the input data. For example, if

18


2. Background

you are working with images where each pixel is a feature, the number of neurons
in the input layer would be equal to the total number of pixels in the image. The
input layer does not perform any computation itself; its primary function is to pass
the input data to the subsequent layers, which are typically hidden layers, and even-
tually to the output layer.

A hidden layer refers to a layer in a neural network that sits between the input
layer and the output layer. The term "hidden" implies that these layers are not di-
rectly observable from the input or output data; they are intermediary layers where
the neural network learns to represent features or patterns from the input data.

The output layer is the final layer of a neural network architecture. It is responsible
for producing the desired outputs or predictions based on the input data and the
learned parameters of the network. The structure and configuration of the output
layer depend on the type of task the neural network is designed to solve.

During training, the dataset is divided into batches of a predefined size. Each
batch contains a subset of the training data. DL processes data in batches and the
classifier learn to adjust the weights of connections between nodes in order to mini-
mize the difference between the predicted output and the actual output for a given
input. This process, known as backpropagation [37], involves iteratively adjust-
ing the weights using optimization algorithms such as stochastic gradient descent.
Once all data has been iterated through one epochs is completed. After completing
one epoch, the training process may continue with additional epochs if needed to
further refine the classifier’s performance. Each epoch provides an opportunity for
the classifier to learn from the entire dataset and improve its performance gradually.

The depth of DL classifiers refers to the number of hidden layers in the neural
network. Deeper networks have the capability to learn more complex data represen-
tations but also require more computational resources and can be more challenging
to train.

2.3.2 Activation functions
An activation function is a mathematical operation applied to the output of each
neuron in a neural network. It serves as a gate, determining whether the neuron
should be activated or not, based on whether the neuron’s input meets a certain
threshold. Activation functions introduce non-linearity, enabling the network to
learn complex patterns and relationships within the text data. Fully connected lay-
ers follow the convolutional layers, allowing the neural network to perform higher-
level reasoning and decision-making based on the extracted features. These layers
connect every neuron from one layer to every neuron in the next, facilitating a com-
prehensive analysis of the text.

ReLU
The ReLU activation function has previously been shown to improve DL neural

19


2. Background

networks [38]. The function works by outputting the max value of a function or
0. The ReLU function is preferred over other activation functions like sigmoid and
tanh because it helps mitigate the vanishing gradient problem [39], which can occur
during the training of deep neural networks. Additionally, ReLU tends to be com-
putationally more efficient compared to some other activation functions.

Sigmoid
The Sigmoid function maps any real-valued number to a value between 0 and 1 [40].
It has an S-shaped curve, with an output close to 0 for large negative inputs, close
to 1 for large positive inputs, and approximately 0.5 at input value = 0. This prop-
erty makes it useful for tasks when trying to model probabilities or create binary
classifications, as it can squash the output of a linear function into the range [0, 1],
thus providing a probability-like interpretation. Sigmoid functions are less favored
now due to issues like vanishing gradients, where the gradient becomes extremely
small for very large or very small inputs, hindering effective learning during back-
propagation.

Tanh
The Tanh function takes any real number as input and outputs a value between -1
and 1. It is essentially a rescaled version of the Sigmoid function, stretching from
-1 to 1 instead of 0 to 1. Tanh has many of the same characteristics as the Sigmoid
function in that it has an S-shape and that the function asymptotically approaches
-1 as the input value approaches negative infinity and approaches 1 as the input
value approaches positive infinity. In neural networks, the Tanh function is often
used as an activation function in hidden layers because it is zero-centered and tends
to make learning easier compared to the Sigmoid function, which suffers from the
vanishing gradient problem. The zero-centeredness helps in mitigating issues related
to vanishing gradients, making optimization more efficient.

2.3.3 Convolution Neural Network
Convolutional Neural Networks (CNNs) are a specialized type of deep neural net-
work architecture commonly applied in text-based classification tasks [41]. Unlike
the traditional use of CNNs in image processing, CNNs for text analysis operate
over one-dimensional sequences of words or characters instead of two dimensions as
is the case with image processing.

In text based CNNs, the convolutional layers apply filters of varying sizes [42] to the
input text, capturing local patterns and features. Each filter will try to capture dif-
ferent patterns in the input text to get a clearer picture when moving to subsequent
layers. The output of applying a filter to the input data is called a feature map.
Each filter produces one feature map, and the number of filters in a convolutional
layer determines the depth of the output volume. The depth corresponds to the
number of filters or feature maps.

Convolutional layers are employed to extract local features or patterns from text

20


2. Background

sequences that it trains the classifier on. One-way filters are used on text to show
how a sentence is structured; this can be done by checking what tokens (tokenized
sentence) are often beside each other. With the sequence from Listing 3 the convo-
lutional neural network would learn that there is a pattern of a non-predetermined
number of words and then a space in between. This is what is typically called a
word in normal English. The tokens from listing 3 inputted into a CNN can be seen
in Figure 2.2.

In order to learn the pattern an operation called convolution operation is involved.
It involves sliding a small filter (also known as a kernel or feature detector) over the
input data and computing the dot product between the filter and the overlapping
region of the input. In Figure 2.2 this is shown as the input layer to the bottom of
the figure converges to the middle layer in the figure. In this case the kernel size
of 2 means two values from the bottom layer converges to one value in the middle
layer, as is shown by the first two values 5 and 8 both having lines drawn to the
leftmost box in the middle layer. The result of this dot product is a single value in
the output, referred to as the "feature map".

The value for sliding can be controlled by setting the stride parameter that de-
termines the step size at which the filter moves across the input data. A stride of 1
means the filter moves one token at a time, while a stride of 2 means it moves two
tokens at a time. Larger strides result in smaller output volumes. In the case of
Figure 2.2 a stride value of 1 means that the leftmost numbers 5 and 8 is computed
first, 8 and 6 are thereafter computed. With a stride of 2 the values 6 and 11 would
instead be computed after 5 and 8. After computing the dot product between the
filter and the input, a non-linear activation function is typically applied element-wise
to the resulting values in the feature map.

After convolution, pooling layers are often used to reduce the spatial dimensions
of the feature maps while retaining important information. One such pooling tech-
nique is Max pooling and divides the input to it into a grid of non-overlapping
regions and for each region takes the maximum value. This helps ensure the net-
work is more computationally efficient and helps reduce overfitting. In Figure 2.2
the last step in the figure is pooling where the three inputs from the middle layer
become one. With max pooling the largest of the three inputs would be the result,
while the other two inputs would be discarded.

Figure 2.2: CNN network.

CNNs for text classification are trained using backpropagation, adjusting the weights

21


2. Background

of the network to minimize the difference between predicted and actual class labels.
Regularization techniques such as dropout and weight decay are often employed to
prevent overfitting during training. Transfer learning can also be leveraged in text
based CNNs, where pre-trained models trained on large text corpora are fine-tuned
for specific classification tasks with smaller datasets. This approach enables the
utilization of knowledge learned from the pre-trained models, enhancing performance
and efficiency.

2.3.4 Long short-term memory
Long Short-Term Memory (LSTM) networks are a type of recurrent neural network
(RNN) architecture. RNN is a type of neural network architecture tailored to process
sequential data. The network maintains an internal state, or memory, enabling the
network to capture dependencies between elements within a sequence, how this is
done can be seen in Figure 2.3. One problem with RNN is the vanishing gradient
problem [43] which occurs when the gradients used to update the weights of the
network during the training process become extremely small as they are propagated
backward through the network layers.

Figure 2.3: Basic RNN structure.

LTSM is specifically designed to address the issues of capturing long-term depen-
dencies and avoiding the vanishing gradient problem that often arises in traditional
RNNs [44]. At the core of an LSTM unit is the concept of a "cell state," which
acts as the memory of the network. This cell state can maintain information over
long sequences, enabling the network to retain context and dependencies over time.
LSTM units have three main components: the input gate, the forget gate, and the
output gate [44]. Each gate is essentially a set of mathematical operations driven
by neural network layers, particularly sigmoid and tanh layers.

22


2. Background

At each time step, the LSTM decides what information to discard from the cell
state using the forget gate. This gate considers the previous cell state and the cur-
rent input and outputs a value between 0 and 1 for each component of the cell state,
where 0 means "completely forget" and 1 means "completely remember". Input Gate:
Simultaneously, the input gate determines what added information to store in the
cell state. It involves two stages, a sigmoid layer that decides which values to update
and a tanh layer that creates a vector of new candidate values that could be added
to the cell state. The forget gate and the input gate work together to update the
cell state. The forget gate decides what information should be removed or retained,
while the input gate decides what added information should be added. Finally,
the output gate controls what information from the cell state should be output at
the current time step. This gate filters the cell state through a sigmoid layer and
produces the final output of the LSTM unit.

23


2. Background

24


3
Related work

Several prior studies have explored the feasibility of anticipating the outcome of a CI
job before its execution. These studies consider various parameters including mod-
ified files, edited lines of code, commit messages and more to predict the outcome.
Beyond prediction on just CI jobs is other research on other types of log prediction
that could possibly be used in order to predict CI jobs as well. Moreover others
have not only investigated how to use predicitons in CI but in other areas related
to improving developers workflow.

3.1 Traditional machine learning in CI
Saidani et al. suggests employing an evolutionary search algorithm [45], partic-
ularly focusing on the Non-dominated Sorting Genetic Algorithm (NSGA-II), for
their research on cutting the CI job build time. Furthermore, they wanted to pro-
vide developers with support tools for setting up the rules characteristically for the
developer CI system. This algorithm operates based on predefined rules established
by developers manually adding them for various pipelines. These rules may encom-
pass factors such as the number of changed files, the magnitude of changes, and
so forth. Their optimal outcome yielded an AUC score of 0.78, accurately labeling
data. Although the algorithm shows promise, this thesis will focus on making pre-
diction in CI logs while the CI job is building, as such it would not be a feasible
algorithm to use.

Xia et al. [46] looked at how it would be possible to predict if a CI build would
succeed given the result from 126 different projects that contained almost 300.000
previous builds. They employed nine distinct classifiers, including decision tree,
GB, logistic regression, multilayer perceptron, multinomial Naive Bayes, nearest
centroid, nearest neighbors, RF, and ridge regression, to discern the most effective
model. However, the findings were not encouraging, with none of the classifiers
achieving an F1 score exceeding 0.6, indicating subpar performance according to
the authors. The AUC score fluctuates more with scores as high as 0.8 and as low
as 0.5. Notably, RF emerged with the highest F1 score in 0.4 of all projects among
the classifiers tested. In terms of the AUC score, GB exhibited the most promise,
leading in 75% of all projects assessed by the authors. Although the results were
not promising as noted by Xia et al., the authors used the classifiers before any CI
job had started. Therefore, the results could improve utilizing it while the CI job is
running.

25


3. Related work

Several other studies has also looked at other ways traditional ML can be used
in order to help with integration of code. One such study by E. Knauss et al. [47]
looked at how it is possible to minimize the CI build time instead of predicting the
build outcome. They researched how ML can be used to find out what test cases to
select when running a test suite in CI. The goal of this was to reduce the time tests
take in the CI build process.

The method used involved using a statistical model that connects changed frag-
ments of code, also called code churns, as well as results of tests. The results show
that using the approach could in the best case provide a 3.71 times speed up com-
pared to running all test cases. In the worst case a 1.1 times speedup was achieved.
This was in the context of one company, so this could vary between different com-
panies but it still shows that there are other ways of improving the feedback time
to developers without the use of CI build outcome prediction.

Another way to find failures before even starting a CI build is with code reviews.
M. Staron et al. [48] state several problems with code reviews. Firstly that humans
tends to focus on the wrong things when reviewing code, stating that this can be
solved with filtering out commits that does not require manual review. Secondly,
reviews can slow down the integration process in the case of long discussions were
longer breaks removes the work flow. Also stating that configuration files are often
part of the review, but can be hard for humans to read.

One possible way to solve this issue is with automating the code review workflow,
although full automation is not desired as it can for example hinder knowledge
transfer [48]. In order to find out what review comments are actually needed, the
researchers classified what can be viewed as a positive or negative comment, correct
classification in 72% of cases and incorrect classification of 28%.

These two papers show that not only prediction on the CI build outcome can help
with the integration process but also other areas such as reducing CI build time
through test case selection and making code reviews easier.

3.2 Deep learning in CI
Numerous studies have explored the application of DL in predicting CI build out-
come. Saidani et al. [49]. conducted a study to evaluate the efficacy of deep learning
using the same Travis CI dataset employed by Xia et al. [46]. Unlike previous re-
search that employed various DL algorithms, this study focused solely on employing
the LSTM classifier, complemented by a genetic algorithm for hyperparameter tun-
ing. The findings revealed a more promising outcome, achieving an AUC score of
80% in certain scenarios, though results varied significantly, sometimes reaching as
low as 50%.

The authors highlighted the considerable computational expense of the algorithm

26


3. Related work

but also acknowledged the potential for further optimization. The outcomes ex-
hibited significant variation when applied to the Travis CI dataset; however, given
its favorable findings, leveraging LSTM for near-real-time log analysis could prove
advantageous.

When employing DL classifiers, they are often trained based on the structure of
successful CI job logs compared to failed job logs. Failed CI job logs frequently con-
tain anomalies when compared to successful CI jobs, rendering algorithms proficient
in anomaly detection suitable for CI job log classification. Siyang et al. [50] explored
three distinct DL clasifiers for anomaly detection within a Hadoop Distributed File
System housing extensive logs exceeding 1GB. Evaluation of these algorithms yielded
a Conventional Neural Network and Multilayer Perceptron achieving an impressive
accuracy, F1 score, and AUC of 99%. In contrast, LSTM scored 95%.

While this thesis diverges from the studie’s focus, it is noteworthy that CNN signif-
icantly outperformed LSTM. The authors emphasize CNN’s suitability for classifi-
cation tasks like the one in the study, as it addresses overfitting by capturing local
semantic information rather than global data. Two key reasons underpin CNN’s
superiority compared to MLP for CI job runs.

Contextual learning in CNN: CNN models incorporate both horizontal em-
bedding codes and longitudinal entries in logs through 2-dimensional convolution
operations. This means that CNNs can learn the correlations between various log
entries more effectively compared to MLPs. MLPs, on the other hand, have a simpler
and faster training procedure but do not utilize contextual information as effectively
as CNNs.

Relationships in log context with CNN: The CNN classifier can extract more
relationships within the log context by leveraging multiple filters. Each line of
parsed data belongs to a unique identifier group and is structured in a sorted order
based on identifiers related to log events. This structured data presentation allows
CNNs to extract meaningful relationships, such as the sequence of log events (e.g.,
"Job start" before "Job is running"). The CNN’s convolutional layers can effectively
extract these related features, leading to improved performance compared to MLPs.

3.3 Just-in time defect prediction
Kamei et al. [51] discusses three major problems with CI build outcome predictions.
The first issue arises when predictions are made at a high level, leading developers
to be uncertain about the causes of predicted outcomes. Secondly, if a particular file
is predicted to contain a bug, it is essential to contact an expert responsible for that
file. However, in some cases, more than 100 developers may be working on the same
file, making it hard to know who to contact. Lastly, predictions are often made too
late in the cycle, as developers aim to detect bugs as quickly as possible.

In contrast to the approaches prevalent at the time of publishing, Kamei et al. [51]

27


3. Related work

propose a new approach called just-in-time, which aims to prompt action at an ear-
lier stage than previous methods. When employing the just-in-time paradigm, it can
be triggered immediately upon a commit being made to a repository. Just-in-time
refers to a principle or methodology that emphasizes delivering necessary resources
or actions precisely when they are needed during the development process. By using
just-in-time, the authors believe they address the three main problems. Firstly, the
first issue is resolved by searching for each new detail in the commit instead of at
the package/file level, resulting in smaller changes for developers to analyze. Sec-
ondly, each new commit has only one person attached to it, meaning that person is
responsible for fixing it. Lastly, immediate feedback is provided as each commit is
checked upon being committed to a repository.

To test the new approach, Kamei et al. [51] used a logistic regression algorithm
where predictions above 0.5 were classified as defects, while others were classified as
non-defects. The results of the study showed that only 20% of all bugs were found,
leading to a poor result according to the authors. The authors further speculate
that this may be due to checking if the change has introduced a defect or not, while
others with better results check whole files and ignore which change caused the de-
fect in the first place.

The problem addressed by Yang et al. [51] is whether it is possible to detect bugs
in code written by developers. Their proposed solution, explored in their paper,
involves utilizing DL with a novel approach named Deeper. The study aims to build
upon the work previously conducted by [51]. Instead of relying on traditional ma-
chine learning methods like logistic regression, Yang et al. employed DL techniques,
specifically utilizing the Deep Belief Network algorithm. Their new approach was
then compared against the method proposed by Kamei et al. [51]. The results
showed a significant improvement, with Yang et al.’s approach detecting 32.22%
more bugs on average (51.04% versus 18.82%).

Recent advancements in just-in-time defect prediction have seen the adoption of
novel algorithms aimed at enhancing prediction accuracy. Among these, the SZZ
algorithm, pioneered by Sliwerski et al. [52], stands out as one of the most preva-
lent for identifying defect commits. Subsequent studies have explored variations of
this original algorithm, such as those introduced by Kim et al. [53] and Da Costa
et al. [54]. Yuanrui et al. [55] conducted a comprehensive analysis of these ap-
proaches to assess their susceptibility to mislabeling changes as either bug-fixing or
non-bug-fixing. Their study revealed that both the Sliwerski et al. and Kim et al.
methodologies exhibited high false positive and false negative rates. Specifically,
the former recorded false positive rates ranging from 2% to 10% and false negative
rates between 13% and 20%. Conversely, the approach by Kim et al. yielded false
negative rates of 1% to 10% and false positive rates of 16% to 45%. Notably, Da
Costa et al. achieved the most promising results, with false positive and false nega-
tive rates falling within the range of 1% to 6%.

Most defect prediction studies focus on identifying defects in commits before a CI

28


3. Related work

job runs, often without explicitly mentioning CI. However, they remain valuable
because defect prediction can facilitate the resolution of issues before they reach the
CI pipeline. This preemptive fixing could potentially render the subject of the the-
sis redundant. Nonetheless, existing defect-finding algorithms typically only detect
around 50% of defects and are susceptible to significant levels of false positives and
false negatives.

3.4 Bringing feedback to developers
With the predictions that are made by algorithms, it is also important that this
information is displayed to the developer in a way that is easy for the developer to
understand. Multiple different approaches have been suggested but all of them have
focused on prediction algoritms that are done before the CI job has started.

One such implementation is a standalone application with a frontend for devel-
opers. Rosen et. al. [56] introduced Commit Guru, a web-based tool capable of
analyzing commits within a repository. The tool tries to solve the problem of pre-
diction tools not being used and the authors states that one significant reason for its
limited adoption is the absence of tools that integrate state-of-the-art analytics and
prediction techniques. It discerns which commits introduced bugs and which ones
remedied them, alongside providing insights like total commit count and metrics
indicating the level of risk associated with commits.

One limitation of the standalone web application approach is its lack of integra-
tion with git repositories [57]. Stating that there has been a growing concern among
researchers regarding the necessity for software analytics to be both explainable
and actionable. In order to remedy the shortcomings JitBot was created which is
an acronym for a just-in-time defect prediction bot developed by Khanan et. al.
[57]. The bot utilizes historical commits as training data to predict defects. Once a
pull request is made, JitBot sends its predictions back to the developer, extracting
necessary features from the data. It then comments on the pull request, explain-
ing its predictions and suggesting potential mitigations. The authors illustrate two
examples: in one, the bot identifies moderate risk due to developers’ lack of experi-
ence and a high number of deleted lines. In another example of low risk, risks stem
from a high number of modified directories and frequent changes to the files involved.

This thesis aims to address an important concern: the lack of evaluation regard-
ing how developers would utilize such a system and whether they find it beneficial
in their workflow. Additionally, the concept of "low experience" used in the paper
[57] remains undefined, casting doubt on the effectiveness of their model.

Another approach to providing feedback to developers involves developing an ex-
tension to an Integrated Development Environment (IDE), as demonstrated by
Kawalerowicz M. and Madeyski L. [58]. Their system, Jaskier, operates with a
server responsible for training and predictions, along with a database for storage.
Whenever a file is saved within the IDE, change statistics are transmitted to the

29


3. Related work

prediction server, which then sends the prediction back to the developer’s IDE. The
authors primarily focus on Visual Studio, which presents a graphical user interface
indicating the likelihood of a file containing defects. For VS Code, this information
is conveyed via the output console along with a corresponding percentage on fail
probability.

This thesis seeks to analyze the different methods above to ascertain developers’
preferences for receiving feedback from a prediction system. Moreover, as it also
delves into predictions during CI build runtime, a topic largely unexplored in exist-
ing feedback, other potential solutions are also explored.

30


4
Research Design

The previous chapter outlined others’ existing work and sets out a baseline for what
result the thesis aims to achieve. This chapter follows the design science research
(DSR) design paradigm [59]. It has more specifically followed the guidelines for
conducting DSR in a master’s thesis [60].

• RQ1: To what degree is it possible to provide feedback to the developer based
on predicting a CI job build outcome?

– RQ1.1: To what degree is it possible to predict the result of a CI job in
just-in-time based on previous data from the same pipeline?

– RQ1.2: What is the relationship between the computational power needed
to predict the build outcome in just-in-time and the size of the job pa-
rameters?

– RQ1.3: How can the feedback on the most probable failure cause from
the just-in-time prediction best be displayed to the developer?

– RQ1.4: What is the perceived usefulness of just-in-time build outcome
predictions for software developers?

– RQ1.5: At what point in the build process is it no longer useful to
continue making predictions?

• RQ2: To what extent can the problems be solved by the potential solutions
in (RQ1)?

Each research question posed defines one cycle, therefore the thesis features two
cycles in total. The initial cycle delves into potential solutions for addressing these
problems. Finally, the second cycle evaluates the efficacy of the solutions proposed
in the second cycle in resolving the issues identified in the first cycle. The subsequent
subsections delve deeper into the handling of each research question. Specifically,
Section 4.1 addresses RQ1, while the subsequent section tackle RQ2.

4.1 Solution
In this thesis, the solution cycle was broken down into smaller iterations, each aimed
at generating a distinct contribution to the artifact. These findings encompassed
various evaluated combinations for the different classifiers, tokenizers, models as
well as hyperparameters used. Also evaluated was their performance measured in
terms of accuracy, MCC and required computational resources and the subsequent
analysis of gathered information. The insights gleaned from analyzing the findings
contributed to the subsequent iteration, thereby enhancing the iterative process.

31


4. Research Design

Each iteration commenced with collaborative planning between the academic and
industrial supervisors to outline the tasks to be undertaken. These plans were flexi-
ble, serving as rough guidelines rather than rigid directives. Adjustments were made
as necessary, particularly based on the outcomes of earlier evaluations within the
iteration, which could influence subsequent stages. The second part of each itera-
tion consisted of building a prototype and evaluating different combinations for the
artifact.

4.1.1 Machine learning model
The first part of the first cycle involved evaluation of different ML models that could
best be used for CI build outcome prediction while the CI job is running. With ML
there are a plethora of parameters that can affect the result and as such it is of
utmost importance to evaluate different configurations for best performance. The
different combinations comprised of various methods for improving the result such
as parameter tuning, data handling and data balancing. The dataset for this study
comprised of logs sourced from Zenseact’s CI system, predominantly containing data
related to the development of safety system software for automotive applications.

The system developed initially operated by employing a downloader capable of re-
trieving log files saved upon the completion of each real CI job. The downloaded
metadata included the log file itself along with its corresponding status, denoting
either a pass or fail outcome. This data served as input for the models utilized in
the prediction system. Where the process was structured as follows by the bullet
list and is connected to the flowchart in 4.1. Each number in the flowchart links to
the explanation in the bullet list.

1. One year of job logs is downloaded from the specified job.
2. The downloaded CI job data is loaded, depending on the balancing used dif-

ferent data is loaded.
3. Data is preprocessed and unnecessary elements from the logs are removed if

necessary. These can for example be timestamps.
4. The preprocessed data is tokenized with a tokenizer chosen for that build.
5. Lastly in the training phase, the tokenized data is handed over to a classifier

which returns the trained classifier.
6. In the first step of simulating predicting a pipeline test data is loaded in

iterations based on the number of lines to predict on for each prediction. This
data follows the same steps as 3 and 4 for it to be handed over to the same
classifier returned in step 5.

7. In this case the data returned will be a 0 if the classifier predicts the job will
fail, otherwise it returns 1 if it predicts the job will succeed.

8. The predictions is evaluated against the real outcome of the job.

32


4. Research Design

Figure 4.1: Flowchart of what is explained in the bullet list.

The concluding phase of each iteration involved analyzing the artifact and determin-
ing the course of action for the subsequent iteration, a process facilitated by collabo-
ration with both industrial and academic supervisors. Each new finding served as a
means to identify shortcomings and areas for improvement. Subsequently, planning
for the next iteration commenced, with the primary aim being to address the identi-
fied shortcomings. Given the time-intensive nature of evaluating each combination,
it was imperative to prioritize those combinations likely to yield meaningful data.
Consequently, numerous combinations had to be discarded, resulting in a selection
of only a few combinations to carry forward to the next iteration.

4.1.2 Interface for developers
Numerous methods exist for CI systems to deliver feedback to developers, and to
determine the most effective ways according to developer preference, multiple de-

33


4. Research Design

signs and approaches were compared for providing feedback from predictions. Two
distinct approaches were employed in selecting which solutions to present to develop-
ers. The first involved an examination of the current feedback mechanisms utilized
at Zenseact within their CI system. The second approach leveraged prior studies
and the methodologies employed therein.

The adoption of these two approaches resulted in the creation of several mockups
available in the interview guide found in the Appendix A, which were subsequently
evaluated through interviews with developers at Zenseact. This process aimed to
ascertain whether a consensus exists regarding the preferred approach or if develop-
ers hold differing opinions on the matter.

The selection of interviewees adhered to purposeful sampling, involving the identifi-
cation of participants with expertise or experience in the field of developing software
as well as those who were available and willing to engage, with an exemption to CI
developers. Each interview spanned approximately 30 to 50 minutes, during which
consent for data collection was obtained, and measures to maintain confidentiality
were assured. Originally, the interviews were intended to exclusively involve develop-
ers working on the primary product, specifically self-driving technology. But in the
end one scrum master and product owner was added in order to gain a broader view
as they knew how the teams operate. All participants with their prior experience
can be found in Table 4.1.

Table 4.1: Performance impact on training with different classifiers, models and
tokenizers.

Name Role Years at Zenseact Years as CI developer
Interviewee A Developer 1.5 No
Interviewee B Developer 7 No
Interviewee C Developer 3.5 0.5
Interviewee D Scrum master 3 2
Interviewee E Developer 2 No
Interviewee F Product owner 7 0.5

4.2 Evaluation

to evaluate the data gathered the thesis has employed a combination of a qualitative
and quantitative evaluation approach. Research questions 1.1 and 1.2 have been
analysed with the help of computational experiments and a quantitative evaluation
approach. In order to answer questions RQ1.3, RQ 1.4 and RQ 1.5 a qualitative
approach has been implored using the interviews as the base. RQ2 has subsequently
been examined both using a qualitative and quantitative evaluation approach.

34


4. Research Design

4.2.1 Computational experiments
During the analysis phase of each iteration for building the machine learning model
hypothesis testing has been used. In some of the iterations time series analysis has
also been used when involved with comparing the results from simulating a CI build.

The u