Improving Continuous Integration Feedback Flow A Design Science Study Master’s thesis in Computer science and engineering Christian Lind Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2024 Master’s thesis 2024 Improving Continuous Integration Feedback Flow A Design Science Study Christian Lind Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2024 Improving Continuous Integration Feedback Flow A Design Science Study Christian Lind © Christian Lind, 2024. Supervisor: Miroslaw Staron, Department of Computer Science and Engineering Advisor: Patrik Firek, Zenseact Examiner: Eric Knauss, Department of Computer Science and Engineering Master’s Thesis 2024 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2024 iv Improving Continuous Integration Feedback Flow A Design Science Study Christian Lind Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract Continuous integration represents a prevalent practice involving the automated merging of code modifications from various contributors into a unified software project. Despite its widespread adoption, this process often entails considerable time and is susceptible to failures. Consequently, efforts have been directed towards anticipating the outcome of the continuous integration process prior to its initia- tion. This thesis explores the feasibility of predicting the outcome in near-real-time, leveraging the data accessible within the continuous integration job at that specific moment, employing a design science research approach across three iterative cycles. Utilizing the design science research approach, the thesis initially delved into the issue by gathering data through interviews and a concise literature review. This process resulted in identifying the problem of delivering improved and swifter feed- back to developers. The literature review also unearthed prior efforts aimed at addressing the same issue, prompting an exploration into employing machine learn- ing to forecast build outcomes based on continuous integration (CI) job log data. The outcomes of evaluating various algorithms spurred both empirical and qualita- tive/quantitative analyses, augmented by interviews with developers at Zenseact. The primary contribution lies in the crafted artifact itself, a significant addition to the realm of predicting the outcome of continuous integration job builds, serv- ing as a practical solution validated within an industrial setting. This artifact not only introduces innovative resolutions to recognized challenges but also enriches the repository of design science knowledge. Keywords: continuous integration, machine learning, just-in-time prediction, design science research v Acknowledgements I would first and foremost like to thank my academic supervisor, Miroslaw Staron, and my industrial supervisor, Patrik Firek, for helping me during the project with any challenges that occurred. I would also like to thank David Friberg for guidance on legal issues and for making the necessary preparations for the thesis. Lastly, I would like to thank the rest of the Overflow team and all the developers at Zenseact who supported me in any way while working on the thesis. Christian Lind, Gothenburg, June 2024 vii Contents List of Figures xi List of Tables xv 1 Introduction 1 1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Purpose of the study . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Limitations and Delimitations . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Significance of the study . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Background 7 2.1 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.4 Overfitting and Underfitting . . . . . . . . . . . . . . . . . . . 13 2.1.5 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.6 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Traditional machine learning . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1 Random forest classifier . . . . . . . . . . . . . . . . . . . . . 17 2.2.2 Gradient boosting . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.1 Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.2 Activation functions . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.3 Convolution Neural Network . . . . . . . . . . . . . . . . . . . 20 2.3.4 Long short-term memory . . . . . . . . . . . . . . . . . . . . . 22 3 Related work 25 3.1 Traditional machine learning in CI . . . . . . . . . . . . . . . . . . . 25 3.2 Deep learning in CI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 Just-in time defect prediction . . . . . . . . . . . . . . . . . . . . . . 27 3.4 Bringing feedback to developers . . . . . . . . . . . . . . . . . . . . . 29 4 Research Design 31 4.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.1 Machine learning model . . . . . . . . . . . . . . . . . . . . . 32 ix Contents 4.1.2 Interface for developers . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2.1 Computational experiments . . . . . . . . . . . . . . . . . . . 35 4.2.2 Interview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5 Artifact 37 5.1 Feedback flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.2 Prediction system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 6 Findings 41 6.1 First iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 6.1.1 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 6.1.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.2 Second iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.2.1 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 6.2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.3 Third iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.3.1 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.3.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.4 Iteration 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.4.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.5 Iteration 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.5.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7 Discussion 113 7.1 Threats to validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 8 Conclusion 121 Bibliography 123 A Interview Template I B Appendix 2 VII x List of Figures 2.1 Example of how the gradient descent algorithm tries to reach zero. . . 13 2.2 CNN network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3 Basic RNN structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.1 Flowchart of what is explained in the bullet list. . . . . . . . . . . . . 33 5.1 Flowchart of feedback loop. . . . . . . . . . . . . . . . . . . . . . . . 38 5.2 Flowchart of how the predictor functions. . . . . . . . . . . . . . . . . 40 6.1 Confusion matrix comparison of RF and GB on Each Row model. . . 43 6.2 Confusion matrix comparison of RF and GB on the Whole Log model. 44 6.3 Confusion matrix comparison of RF and GB on the Logtime model. . 45 6.4 Time and line progression for each test sample in job 1. . . . . . . . . 49 6.5 Comparison of predicting on each line compared to on only every tenth line. The RF classifier is used togheter with the char tokenizer. 50 6.6 Comparison of different tokenizers on Whole Log model with MCC scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 6.7 Logtime model’s MCC score using several types of tokenizers. . . . . 51 6.8 Difference between accuracy when predicting unsuccessful and suc- cessful jobs running on the Logtime model. . . . . . . . . . . . . . . . 53 6.9 Number of incorrect predictions using the RF classifier and char to- kenizer where accuracy is represented by the blue line, furthermore the number of incorrect predictions is depicted by the orange line over amount of lines used in prediction. . . . . . . . . . . . . . . . . . . . 53 6.10 Number of jobs that are left after a certain number of lines has been reached. Where accuracy is represented by the blue line, furthermore the number of jobs remaining is depicted by the orange line over amount of lines used in prediction. . . . . . . . . . . . . . . . . . . . 54 6.11 Impact of keeping and removing the timestamp from the logs running on the Logtime model. . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.12 Time and line progression for each test sample in job 2. . . . . . . . . 61 6.13 Number of jobs that are left after a certain number of lines has been reached in job 2. Accuracy is represented by the blue line as well as the number of jobs remaining depicted by the orange line over amount of lines used in prediction. . . . . . . . . . . . . . . . . . . . . . . . . 61 6.14 Time and line progression for each test sample in job 3. . . . . . . . . 62 xi List of Figures 6.15 Number of jobs that are left after a certain number of lines has been reached in job 3. Where accuracy is represented by the blue line as well as the number of jobs remaining depicted by the orange line over amount of lines used in prediction. The combination used for getting the accuracy metric is RF with the char tokenizer and equal classes. . 62 6.16 How balancing data on equal classes, 70/30 split in favor of good samples and all samples used affects the MCC score of the different classifiers and tokenizers when running on the Logtime model using MCC as metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.17 How balancing data on equal classes, 70%/30% split and all samples used affects the accuracy of the different classifiers and tokenizers when running on the Logtime model using accuracy as metric. . . . . 65 6.18 How balancing data on equal classes and 70%/30% split affects the different classifiers and tokenizers when running on Logtime model. . 67 6.19 GridSearchCV best hyperparameters versus default hyperparameters for all tokenizers and classifiers on the Logtime model on job 1 using MCC as metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.20 GridSearchCV best and default hyperparameters for all tokenizers and classifiers on the Logtime model on job 1 using accuracy as metric. 71 6.21 GridSearchCV best hyperparameters versus default hyperparameters for all tokenizers and classifiers on the Lines model using MCC as metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.22 GridSearchCV best versus default hyperparameters for all tokenizers and classifiers on the Lines model on job 1 using accuracy as metric. . 76 6.23 Best hyperparameters for each combination of classifier and tokenizer on job 1, showing results of predictions on the Lines model for all datasets, successful jobs and unsuccessful jobs using accuracy as metric. 79 6.24 Best parameters for each combination of classifier and tokenizer, show- ing results of predictions on equal classes for the Logtime model run- ning on job 2 using MCC as the metric. . . . . . . . . . . . . . . . . . 81 6.25 Best parameters for each combination of classifier and tokenizer, show- ing results of predictions on equal classes for the Logtime model run- ning on job 2 using accuracy as the metric. . . . . . . . . . . . . . . . 82 6.26 Best parameters for each combination of classifier and tokenizer, show- ing results of predictions on equal classes for the Logtime model run- ning on job 3 using MCC as metric. . . . . . . . . . . . . . . . . . . . 84 6.27 Best parameters for each combination of classifier and tokenizer, show- ing results of predictions on equal classes for the Logtime model run- ning on job 3 using accuracy as metric. . . . . . . . . . . . . . . . . . 85 6.28 Best hyperparameters for each combination of classifier and tokenizer, showing results of predictions on equal classes for the Lines model running on job 2 using the MCC metric. . . . . . . . . . . . . . . . . 87 6.29 Best parameters for each combination of classifier and tokenizer, show- ing results of predictions on equal classes for the Lines model running on job 2 using the accuracy metric. . . . . . . . . . . . . . . . . . . . 88 xii List of Figures 6.30 Best parameters for each combination of classifier and tokenizer, show- ing results of predictions on all datasets for the Lines model running on job 3 with equal classes. . . . . . . . . . . . . . . . . . . . . . . . . 90 6.31 Best layer and epoch count for each combination of classifier and tokenizer, showing results of predictions for the Logtime and Lines models running on job 1 with equal classes using MCC as measurement. 96 6.32 Best layer and epoch count for each combination of classifier and tokenizer, showing results of predictions for the Logtime and Lines models running on job 1 with equal classes using accuracy as mea- surement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.33 Best layer and epoch count for each combination of classifier and tokenizer, showing results of predictions for the Logtime and Lines models running on job 2 with equal classes using MCC as measurement. 99 6.34 Best layer and epoch count for each combination of classifier and tokenizer, showing results of predictions for the Lines model running on job 2 with equal classes. . . . . . . . . . . . . . . . . . . . . . . . . 100 6.35 Best layer and epoch count for each combination of classifier and tokenizer, showing results of predictions for the Logtime and Lines models running on job 3 with equal classes using MCC as measurement.102 6.36 Best layer and epoch count for each combination of classifier and tokenizer, showing results of predictions for the Logtime and Lines models running on job 3 with equal classes. . . . . . . . . . . . . . . . 104 A.1 Link to CI system inside Gerrit commit comment leading to artifact. III A.2 Showing how likely the prediction system thinks the build is going to fail after how many lines of log is printed to the log. . . . . . . . . . . III A.3 Showing how likely the prediction system thinks the build is going to fail after how many lines of log is printed to the log. . . . . . . . . . . III A.4 Link to CI system inside Gerrit commit comment leading to marked text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV A.5 Example of log text that are marked in red as the prediction system thinks this will cause the build to fail. . . . . . . . . . . . . . . . . . . IV A.6 Example message sent to the user from a botwith a link leading to an artifact in the CI system. . . . . . . . . . . . . . . . . . . . . . . . IV A.7 Prediction button when no prediction is available. . . . . . . . . . . . V A.8 Prediction button when a prediction is available. . . . . . . . . . . . . V A.9 Message shown to the user when clicking the predictions button. . . . V A.10 Icon to notify the developer a prediction for the edited code is avaliable. V B.1 Comparison of different tokenizers on Whole Log model with accuracy.VII B.2 Logtime model’s accuracy using several types of tokenizers. . . . . . . VII B.3 Difference between predicting unsuccessful and successful jobs run- ning on the Whole Log model with different tokenizers. . . . . . . . . VIII B.4 Impact on MCC score when keeping and removing the timestamp from the logs running on the Whole Log model. . . . . . . . . . . . . IX B.5 Impact on accuracy when keeping and removing the timestamp from the logs running on the Logtime model. . . . . . . . . . . . . . . . . . X xiii List of Figures B.6 Impact on accuracy when keeping and removing the timestamp from the logs running on the Whole Log model. . . . . . . . . . . . . . . . XI xiv List of Tables 4.1 Performance impact on training with different classifiers, models and tokenizers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 6.1 Combinations evaluated in the first iteration. . . . . . . . . . . . . . . 42 6.2 Accuracy metrics when running model Each Row with the LabelEn- coder tokenizer. RF metrics to the left and GB to the right. . . . . . 43 6.3 Accuracy metrics when running Whole Log model with the LabelEn- coder tokenizer. RF metrics to the left and GB to the right. . . . . . 44 6.4 Accuracy metrics when running Logtime model with the LabelEn- coder tokenizer. RF metrics to the left and GB to the right. . . . . . 45 6.5 Combinations evaluated in the second iteration. . . . . . . . . . . . . 48 6.6 Performance impact on training with different classifiers, models and tokenizers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.7 Combinations evaluated in the third iteration. . . . . . . . . . . . . . 60 6.8 Number of good and bad samples in the different jobs. . . . . . . . . 61 6.9 DTW distance between default and best hyperparameters as well as the performance impact on training with different classifiers, tokeniz- ers and hyperparameters on the Logtime model. . . . . . . . . . . . . 72 6.10 DTW distance between best hyperparameter and default as well as the performance impact on training with different classifiers, tokeniz- ers and hyperparameters with the Lines model. . . . . . . . . . . . . 80 6.11 DTW distance between best and default hyperparameters as well as the performance impact on training with different classifiers, tokeniz- ers and best hyperparameters with the Logtime model on job 2. . . . 83 6.12 Performance impact on training with different classifiers, tokenizers and hyperparameters with the Logtime model on job 3. . . . . . . . . 86 6.13 DTW distance between the best hyperparameters and the default hyperparameters as well as the performance impact on training with different classifiers, tokenizers and hyperparameters with the Lines model on job 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.14 DTW distance between the best hyperparameters and the default hyperparameters as well as the performance impact on training with different classifiers, tokenizers and hyperparameters with the Lines model on job 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.15 Combinations evaluated in the fourth iteration. . . . . . . . . . . . . 94 xv List of Tables 6.16 Performance impact on training with different DL classifiers, models, tokenizers and best hyperparameters on job 1. . . . . . . . . . . . . . 98 6.17 Performance impact on training with different DL classifiers, models, tokenizers and best hyperparameters on job 2. . . . . . . . . . . . . . 101 6.18 Performance impact on training with different DL classifiers, models, tokenizers and best hyperparameters on job 3. . . . . . . . . . . . . . 104 A.1 Interview questions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . I xvi 1 Introduction As software repositories become bigger with more and more tests added the time taken for a continuous integration (CI) job to complete continues to increase. Af- ter committing changes to a version control system, developers are often eager to promptly receive feedback from the CI job regarding the success or failure of their commit [1]. When a CI job takes over 30 minutes [2] to complete the developer loses focus and the workflow is disrupted. They also desire concise and explanatory data to promptly identify the specific failure that led to the erroneous return of the CI job [3]. To address the aforementioned challenges, predicting the build outcome in CI has emerged as a prominent area of interest. Previous studies have primarily focused on predicting job outcomes based on commit metadata. However, scant attention has been paid to predicting the job outcome while the CI job is actively running. During this phase, additional parameters become available, which were not consid- ered in earlier research efforts. Consequently, leveraging these parameters during job execution should enhance the accuracy of outcome predictions. The primary objective of this study is to identify and implement enhancements to the developer workflow, particularly focusing on expediting feedback from the CI system. This study will be conducted in collaboration with the observability team at Zenseact, a company dedicated to advancing fully autonomous vehicles. Collaborating with Zenseact offers the advantage of leveraging their established CI infrastructure, along with access to tens of thousands of saved logs from previous CI job runs. 1.1 Problem Description A developer will often want a fast response as to why the code has failed to be inte- grated with the CI system. This is because they first want to see why the commit in question failed the tests. CI systems today therefore employ techniques that are supposed to make builds faster using cache, multithreading and optimizing code as much as possible [4]. What all of these methods have in common is that they can break an already working system. Even with all these techniques in a large-scale product, the CI jobs can still take multiple hours to finish [4]. Where eagerness for quick feedback may turn into frustration over a slow completion. The pipeline job may also have some problems which are not that easy to spot for a developer not 1 1. Introduction that invested into how the CI system works [5]. This can mean a pipeline job that takes longer than usual might cause warning signs for newcomers while in reality, it may for example only be that a new update needs to be installed. One problem for developers is that a single pipeline job can take multiple hours to complete, with over 40% of all jobs reportedly taking more than 30 minutes [2]. This can lead to long wait times for commits to finally be merged into the repository as each new commit has to go through a new integration process until all stages in the CI job pass. If there are multiple bugs in the code that the developer has not fixed this might require multiple commits in order for all fixes to be applied depend- ing on when the developer finds the bugs and how long it takes to correct them. As a result of long build times developers tend to lose focus and this can hurt productivity and parallel development [6]. Long build times can also lead to more computational resources needed in order to complete the build process before a failure is discovered. Waiting for a CI build to finish or starting another project on the side while waiting for a CI build to finish can negatively affect the productivity of a developer [7]. To remedy this problem machine learning (ML) algorithms can be used in order to predict the outcome of the build [8]. This can help developers in predicting if a build will fail or pass and thus help the developer know the outcome of the build beforehand. However, this approach in ML cannot help in finding where a failure might occur in the code they have uploaded. This can lead to the developer having to spend a lot of time looking through the code to find the failure. The failure might not be that clear at first before running the pipeline job and getting the result, re- sulting in significant time wasted by the developer. Another problem for most developers, a problem is how verbose a log should be. When running all checks in a CI job it is typical to also log every step the CI job takes in order to ensure everything is working accordingly. These logs can for ex- ample contain the CMake log from building the program with CMake. While these logs can help the developer find the cause of the build issues, it may not be as clear as one would expect [9]. A more verbose build log may contain unnecessary details making it hard to read [10]. On the other hand, a minimal log may not contain all the necessary information for solving the problem [11]. In large CI systems with a lot of building and testing, it is common to get back a log with megabytes of data [12]. When a pipeline job fails it will indicate which stage the failure occurred in but if that stage has a lot of log data, it can still be very time-consuming for the developer to find the failure [13]. With this, it is also frustrating for the developer to get back a large log showing a failure that is hid- den behind thousands of lines of log entries. For failed jobs, it can be challenging to quickly find the reason why the underlying job failed [13]. This results in an in- creased cognitive load on developers. It can also lead to questions about what caused the pipeline to fail [14]. Developers often look for keywords in the logs which might indicate the failure [14]. But this may be misleading and thus trick the developer into fixing something that is not broken in the first place or completely unrelated 2 1. Introduction to the real issue. A CI DevOps engineer would want to find failures in the CI job itself. While the developer only cares about the issue that is specific to his commit. This can create a scenario where the developer must search through large logs to find the right issues for his build. Zampetti et al. [8] mention that their survey reports dis- agreement on how large a log should be, later remarking that it is almost impossible to find a scenario where a more minimalistic log is better. Each passed or failed CI job is accompanied by a log in textual form. For a pipeline job that compiles C++ libraries, the log is a C++ build log, and a failed job should contain infor- mation about why the build failed. For a normal developer working with the C++ code, it might be hard to fully understand how to search the log for the related issue. Today developers often get to choose where it is relevant to log inside of the code [10]. This can be problematic as developers have different domain knowledge. Which can lead to not properly logging in places where an error can occur. This can in turn lead to logs that contain no information regarding the real problem for why a build failed and thus will be hard to debug [10]. Such a problem can be seen in Listing 1 where Docker has tried to push a new Docker build but failed because of some Python script. Usually, there is an exception shown as to why the Python script has failed but this is not the case in this instance. If a log does not contain the information needed, then it will be impossible for a developer as well as a prediction algorithm to correctly find what caused the build to fail. In this case, it is irrelevant how much log data there is and how fast a developer wants to get the information as the relevant information does not exist. &1|20:52:35.222 Running: docker push hub.docker.com/car_platform/ master/71ce2b95d1d76856eceb84/config:src-2.11 &1|20:52:35.265 !! Error when executing ['/var/lib/agent/pipelines/ src/build/a549-4392-19b57c7915cc/venv/bin/python', 'config/jobs/deliver_car_platform.py'], return_code 1 ?1|20:52:35.395 [go] Task status: failed (147597 ms) (exit code: 1) Listing 1: Example of log output with no information on what caused the error. 1.2 Purpose of the study The purpose of the study is to explore to what extent it is possible to better provide feedback in near-real-time to developers from a CI system. The study will result in a proof of concept that entails simulating a CI system that sends streams of new logs in near-real-time to a model for analysis. This model then provides near- real-time feedback if the build is likely to fail. Giving developers the chance to choose between multiple different ways to implement a feedback flow. Thereafter, the feedback flow is then evaluated based on how developers perceive its usefulness compared to a normal CI workflow. Additionally, qualitative analysis is conducted to assess the accuracy and the MCC score of the algorithm in predicting build outcomes 3 1. Introduction in near-real-time. These findings are then juxtaposed with developers’ expectations regarding prediction accuracy. Furthermore, considerations such as time savings and computational resources are factored in to determine the viability of such a solution. Ultimately, this thesis aims to benefit both developers and researchers by providing insights into optimizing the CI workflow. 1.3 Limitations and Delimitations The thesis will focus on providing feedback to the developer in near-real-time while the CI job is running. It will therefore not feature any study on providing feedback before or after the job has been run. The feedback will only be provided using already existing ML classifier, meaning no research on new classifiers or such. In this thesis, multiple combinations of models and classifiers will be evaluated, but it is impossible to evaluate all in this thesis’s scope. Classifiers have been chosen based on performance in others work that are working in a similar area in predicting a CI build outcome or classifying logs. The models have been created and chosen based on how log files are typically preprocessed in a similar scenario. Another limitation is that only CI logs from one company are used. There are multiple jobs used, but all are provided by a company specializing in making au- tonomous driving for cars. There are more aspects of a car than what Zenseact is building and therefore the way to build those pipelines might look different for other companies. This is also true for any other industry as well when the needs of the company are different. 1.4 Significance of the study This study contributes both to the academic field as well as practitioners. In academia, it explores a novel area by predicting CI job outcomes in near-real-time using log data generated by a CI system. An area previously only looked at before a job has been started. It identifies various methods for making such predictions and evaluates their effectiveness, ultimately presenting an artifact that underscores the effectiveness inherent in developing such a predictive system. Contribution towards Zenseact has been made by having a new prediction system set up and ready for use. Numerous CI jobs from Zenseact have undergone test- ing, and additional ones can be seamlessly integrated. While the contribution to Zenseact surpasses that of other companies due to the accessibility of these tools, the knowledge provided in this report facilitates the replication of similar systems by other organizations. Furthermore, companies can fine-tune various parameters to optimize compatibility with their CI systems. 4 1. Introduction 1.5 Thesis outline In Chapter 1, a concise overview of the thesis subject was presented, including an exploration of the project’s objectives and limitations. Chapter 2 introduces relevant terminology and concepts for this thesis. Chapter 3 presents other studies relevant to this thesis and how their findings will be used. Chapter 4 describes the research methodology used in this study. Chapter 5 describes the developed artifact and its functionality. Chapter 6 presents the results and analysis of all iterations. Chapter 7 argues for the findings and answers the research questions. Chapter 8 draws conclusions for the thesis. 5 1. Introduction 6 2 Background This chapter introduces important concepts that are integral to the workflow. It begins by exploring the fundamentals of machine learning (ML) and the evalua- tion metrics used throughout the thesis, it then proceeds to first discuss traditional machine learning and later on deep learning (DL) techniques, elucidating their re- spective functionalities. 2.1 Machine learning Machine learning (ML) is a branch of artificial intelligence (AI) that empowers com- puters to acquire intelligence and perform tasks without explicit programming. ML is used to simulate human learning activities [15] for obtaining new information and skills in order to continuously improve knowledge. ML algorithms can be broadly categorized into several types, each tailored to different learning scenarios. Super- vised learning involves learning from labeled data, where the algorithm predicts output based on input-output pairs provided during training. In contrast, unsuper- vised learning explores unlabeled data to uncover hidden structures and patterns. Semi-supervised learning represents a hybrid approach, combining elements of super- vised and unsupervised learning, and enabling agents to learn through interactions with environments, respectively. 2.1.1 Data preprocessing Data encompasses the raw information, whether structured or unstructured, that is fed into a model to enable learning and decision-making. This data can come from various sources, such as databases, sensors, text documents, images, or audio recordings. However, data quality and quantity play crucial roles in determining the model’s effectiveness and reliability. Data preprocessing is often the initial step in preparing the data for training with an ML classifier. This involves tasks like cleaning the data to remove noise and incon- sistencies, handling missing values, and transforming the data into a format suitable for the model [16]. Removing noise refers to the process of eliminating irrelevant or unwanted information from the dataset. Noise can manifest in various forms such as outliers, errors, or inconsistencies in the data. Removing noise is crucial because it can adversely affect the performance and accuracy of ML classifiers by introducing unnecessary variability or bias. 7 2. Background Additionally, features may need to be extracted or engineered to enhance the clas- sifier’s ability to learn relevant patterns and relationships within the data. Features represent individual measurable properties or characteristics of the data used as in- put for an ML model. Features are essential components of the dataset that provide information to the classifier, allowing it to learn patterns and make predictions. In the end the goal of data preprocessing is to find the best set of features for the classifier used [17]. The two most relevant techniques for this thesis are addressing data imbalance and handling outliers. Addressing data imbalance In classification problems, imbalanced datasets occur when one class of the target variable is significantly more prevalent than others [18]. Data cleaning may involve techniques such as resampling (oversampling minority class or undersampling ma- jority class) to address data imbalance and improve the performance of the model. Oversampling involves increasing the number of instances in the minority class by randomly replicating them or generating synthetic samples. Undersampling involves reducing the number of instances in the majority class by randomly selecting a sub- set of instances. This can help balance the class distribution, but it may also result in loss of information. Combining oversampling and undersampling techniques can be beneficial to balance the dataset while minimizing the loss of information. Handling outliers Outliers are data points that deviate significantly from the rest of the dataset. Out- liers can skew statistical analyses and affect the performance of ML models. Tech- niques for handling outliers include trimming (removing extreme values), capping (replacing extreme values with a predefined threshold) or transforming the data to be more robust to outliers. When working with log data not all output is normal text, this can be due to numerous varied factors, either it is intentional or there can be some bug in the system. These occurrences can in this case be categorized as what are known as "Error outliers" [19] which are points in a dataset that is not of interest to the population. In this case, the population would be text. 2.1.2 Tokenization When working with strings the data must first be tokenized before being inputted into a classifier. The process of tokenization is essential when machines need to understand and process human language. Tokenization involves breaking down un- structured text, which can be anything from sentences to entire documents, into smaller units called tokens [20]. These tokens could be individual words, subwords, or even characters, depending on the specific requirements of the task. For instance, when tokenizing a sentence, each word might become a token, or the words might be further segmented into chars. After tokenization, the tokens are converted into numerical representations through vocabulary building, which assigns a unique in- dex to each token. To help with understanding how the tokenizers work, each of the different tokenizers will be accompanies by an example taken from Listing 1 8 2. Background "?1|20:52:35.395 [go] Task status: failed (147597 ms) (exit code: 1)" showing how this string is tokenized. Word tokenization Word tokenizer scans through the input text character by character, identifying patterns such as spaces, punctuation marks, and special characters that separate words or indicate word boundaries. During this process, it detects word boundaries using criteria like whitespace characters, punctuation marks, and language-specific rules. Special cases like contractions, hyphenated words, and abbreviations are also handled. Various word tokenizers treat special cases differently, but this thesis will utilize the NLTK Treebank word tokenizer [21]. As it processes the text, the tok- enizer generates a sequence of tokens, each representing a single word or a meaningful unit of text such as a special character. As can be seen by Listing 2 where spaces are excluded from being tokenized, instead they only act as a separator. {0: '(', 1: ')', 2: '1', 3: '147597', 4: '1|20:52:35.395', 5: ':', 6: '?', 7: 'Task', 8: '[', 9: ']', 10: 'code', 11: 'exit', 12: 'failed', 13: 'go', 14: 'ms', 15: 'status', 16: ''} Listing 2: Tokenized mapping of words. The final output array of tokenizing the input string would then look like in Listing 3. [6, 4, 8, 13, 9, 7, 15, 5, 12, 0, 3, 14, 1, 0, 11, 10, 5, 2, 1] Listing 3: Tokenized input with the word tokenizer. Character tokenization Character tokenization works by iterating through each character in the text, in- cluding letters, digits, punctuation marks, and whitespaces. For each character encountered, the tokenizer generates a token representing that character. It does not consider word boundaries or spaces; instead, it treats each character as a sepa- rate unit. The tokenizer continues this process until it reaches the end of the input text, producing a sequence of character tokens. For the example text provided above the character tokenization would produce the mapping shown in Listing 4. {0: ' ', 1: '(', 2: ')', 3: '.', 4: '0', 5: '1', 6: '2', 7: '3', 8: '4', 9: '5', 10: '7', 11: '9', 12: ':', 13: '?', 14: 'T', 15: '[', 16: ']', 17: 'a', 18: 'c', 19: 'd', 20: 'e', 21: 'f', 22: 'g', 23: 'i', 24: 'k', 25: 'l', 26: 'm', 27: 'o', 28: 's', 29: 't', 30: 'u', 31: 'x', 32: '|', 33: ''} Listing 4: Tokenized mapping of characters. The final output array of tokenizing the input string would then look like in Listing 5. 9 2. Background [13, 5, 32, 6, 4, 12, 9, 6, 12, 7, 9, 3, 7, 11, 9, 0, 15, 22, 27, 16, 0, 14, 17, 28, 24, 0, 28, 29, 17, 29, 30, 28, 12, 0, 21, 17, 23, 25, 20, 19, 0, 1, 5, 8, 10, 9, 11, 10, 0, 26, 28, 2, 0, 1, 20, 31, 23, 29, 0, 18, 27, 19, 20, 12, 0, 5, 2] Listing 5: Tokenized input with the characters tokenizer. Byte-Pair-encoding tokenization Byte Pair Encoding (BPE) tokenization begins with initializing a vocabulary con- taining all the characters or symbols present in the corpus. Next, the algorithm iterates over the corpus and identifies the most frequent pair of adjacent characters or character sequences. It merges the most frequent pair into a new symbol and updates the vocabulary accordingly. This process continues for a specified number of iterations or until a certain vocabulary size is reached. As the iterations progress, the algorithm gradually builds a vocabulary of subword units that represent fre- quently occurring character sequences in the corpus. During tokenization, the input text is segmented into subword units based on the vocabulary learned during train- ing. The tokenizer then replaces rare or out-of-vocabulary words with a combination of subword units that are present in the vocabulary. Byte Pair Encoding (BPE) tokenization is effective for handling rare words, morphologically rich languages, and out-of-vocabulary terms by breaking them down into smaller, more manageable subword units. Using ChatGPT4 tokenizer which has already been handed a full corpus, tokenizing the above message results in the following dictionary shown in Listing 6 for the above provided sentence. {30: '?', 16: '1', 91: '|', 508: '20', 25: ':', 4103: '52', 1758: '35', 13: '.', 19498: '395', 510: ' [', 3427: 'go', 60: ']', 5546: ' Task', 2704: ' status', 4745: ' failed', 320: ' (', 10288: '147', 24574: '597', 10030: ' ms', 8: ')', 13966: 'exit', 2082: ' code', 220: ' '} Listing 6: Tokenized mapping of BPE using ChatGPT4 corpus. The final output array would then look like in Lisiting 7. [30, 16, 91, 508, 25, 4103, 25, 1758, 13, 19498, 510, 3427, 60, 5546, 2704, 25, 4745, 320, 10288, 24574, 10030, 8, 320, 13966, 2082, 25, 220, 16, 8] Listing 7: Tokenized input with the GPT4 tokenizer. The dictionary when using GPT4 tokenizer can be compared to not already having a full corpus that would look like it does in Listing 8. Here the full corpus is the log data that is provided. 10 2. Background {1: '?1|20:52:35.395', 2: '[go]', 3: 'Task', 4: 'status:', 5: 'failed', 6: '(147597': 1, 'ms)', 7: '(exit', 8: 'code:', 9: '1)'} Listing 8: Tokenized mapping of BPE using corpus from the provided text. As the mapping in Listing 8 does not have the same context as ChatGPT4 the tok- enizer tries to merge pairs of characters, often characters are only once beside each other in a word in this case. LabelEncoder The LabelEncoder is a predefined module within the Python package Scikit-learn, accessible through the pip library. It accepts input data in the form of an array and outputs corresponding tokens. Unlike other tokenizers utilized in this thesis, the LabelEncoder offers versatility by effectively serving as a substitute for various tokenization methods. For instance, the character tokenizer’s functionality can be replicated by segmenting the log into an array of characters and feeding it into the LabelEncoder. The tokenized mapping of the string would therefore look like List- ing 4 and then the output would be as shown in Listing 5. Without first segmenting the log into characters the tokenizer would instead output as is shown in Listing 9 {'?1|20:52:35.395 [go] Task status: failed (147597 ms) (exit code: 1)': 0} Listing 9: Tokenized mapping of the entire input text. The final output array of tokenizing the input string would then look like in Listing 10. [0] Listing 10: Tokenized input with the LabelEncoder. The reason for this behavior with LabelEncoder is that it necessitates manual di- vision of the text into smaller inputs by the user. With proper segmentation, it can perform the same function as other tokenizers. For instance, if the text is di- vided into two strings within an array beforehand, the resulting output array would resemble the depiction shown in Listing 11. [0, 1] Listing 11: Tokenized output from Labelencoder when two strings are sent in one array as input. 2.1.3 Training Once the data is preprocessed and tokenized, it is divided into training and test sets. The training dataset is used to teach the classifier to recognize patterns [22] 11 2. Background and make predictions based on the input data. The test dataset is employed to fine-tune the classifier’s hyperparameters and assess its performance during train- ing, helping to prevent overfitting, where the classifier memorizes the training data but fails to generalize to new data. Training an ML classifier involves feeding the preprocessed data into the classi- fier and iteratively adjusting its parameters to minimize the difference between the predicted outputs and the actual outputs. Other options besides hyperparameter tuning are to change how the preprocessed data is structured. This can be done by removing unwanted features or using several types of tokenization. This optimiza- tion process typically involves using algorithms such as gradient descent to update the classifier’s hyperparameters based on the computed loss or error. The algorithm starts at any random point with an initial set of parameters [23]. in the example provided in Figure 2.1 the start point is at the cost of 10000. At that point, the gradient of the terrain is calculated, which tells the algorithm the direction of the steepest slope. In ML, this gradient represents the direction of the steepest increase in the loss function, in the case of Figure 2.1 this is towards 0 where the start point is at 100. A loss function, in the context of ML and optimization algorithms, is a mathematical function that measures the discrepancy between the predicted values of a classifier and the actual ground truth values. It essentially quantifies how well the model is performing on a particular task. The goal of train- ing a classifier is to minimize this loss function, thereby improving the classifier’s ability to make accurate predictions. Once the algorithm has found the direction of the steepest slope, it will try to take a small step in the opposite direction. This step size is called the learning rate, and it determines how far it is possible to move in each iteration and this is the same as each red dot in Figure 2.1. The current position is then updated to the new position that has been reached after taking the step downhill. This process is then repeated iteratively [23], recalculating the gradient, taking a step downhill, and updating the position until a point is reached where the gradient is close to zero or until a predefined number of iterations has been reached. In Figure 2.1 the number of iterations was reached before it could reach zero. 12 2. Background Figure 2.1: Example of how the gradient descent algorithm tries to reach zero. The training process is iterative and computationally intensive, especially for com- plex models or large datasets. It often requires significant computational resources, such as powerful CPUs or GPUs, to expedite the training process. Additionally, techniques like parallel processing and distributed computing may be employed to accelerate training and handle large volumes of data efficiently. Having ample data necessitates having access to an abundant amount of memory since the model re- mains stored in memory throughout the training process. Throughout the training process, monitoring and optimization are essential to en- sure that the model converges to a satisfactory solution and generalizes well to unseen data. This may involve monitoring performance metrics, adjusting hyper- parameters, and incorporating feedback from the test set to fine-tune the model’s architecture and parameters. 2.1.4 Overfitting and Underfitting When the classifier is training on previously unseen data overfitting and underfitting can occur. Overfitting and underfitting are fundamental concepts in ML that relate to how well a classifier learns from training data and generalizes to new, unseen data. Overfitting Overfitting occurs when an ML classifier learns the training data too well [24], in- cluding noise and random fluctuations in the data, to the extent that it negatively impacts the classifier’s ability to generalize to new, unseen data. Essentially, the classifier memorizes the training data rather than learning the underlying patterns. As a result, an overfitted classifier performs very well on the training data but poorly on new, unseen data. Signs of overfitting include a high accuracy on the training dataset but a significantly lower accuracy on the validation or test datasets. Over- fitting often happens when the classifier is too complex relative to the amount and 13 2. Background quality of the training data, or when the classifier is trained for too many iterations. To address overfitting many different techniques can be used, these are some of them: Cross-validation: In certain scenarios, the ML classifier used can show very good accuracy on one part of a dataset but may not be generalized on other parts of the dataset [25]. Cross-validation involves partitioning the dataset into multiple subsets, known as folds. The classifier is trained on a portion of the data and validated on the remaining folds. This process is repeated multiple times, with each fold serving as both the training and test set in turn. Performance metrics obtained from each iteration are then averaged to provide a more robust estimate of the classifier’s per- formance. Training with More Data: Increasing the size of the training dataset can help the classifier to learn the underlying patterns better and reduce overfitting [24]. If obtaining more data is feasible, it is often one of the most effective ways to mitigate overfitting. This however comes with the tradeoff that more computing power will be needed as the ML classifier needs to fit the new data. Underfitting Underfitting, on the other hand, occurs when a ML classifier is too simple to cap- ture the underlying structure of the data. In this case, the classifier fails to learn the patterns present in the training data and performs poorly on both the training and unseen data [24]. Underfitting can happen for assorted reasons, such as using a classifier that is too simple, not providing enough training data, or insufficient training time. Signs of underfitting include low accuracy on both the training and validation/test datasets. There are lots of different strategies to mitigate underfit- ting and the ones the thesis will cover are: Increase Classifier Complexity: Underfitting happens when the classifier is too simple to capture the underlying patterns in the data. One way to try and solve this problem is by increasing the complexity of the model by adding more layers (only for neural networks), increasing the number of parameters, or using a more complex algorithm. Feature Engineering: Feature engineering revolves around the creation, selec- tion, and transformation of features from the original dataset to facilitate better classifier learning and prediction [26]. Effective feature engineering entails extract- ing relevant information, reducing dimensionality, and encoding domain knowledge into the feature space, thereby enabling classifiers to capture underlying patterns more effectively. One way of extracting features from text is the use of different tokenizers which will transform the text in diverse ways. Hyperparameter Tuning: Hyperparameters are parameters that govern the be- havior and performance of ML algorithms, distinct from classifier parameters that are learned from the training data [27]. Hyperparameter tuning involves systemati- cally exploring and selecting optimal values for these parameters to enhance classifier 14 2. Background performance and mitigate issues such as overfitting and underfitting. In order to more easily find what parameters are most suited for the use case, several algorithms can be used to find the perfect fit. The two most popular algorithms are Random- SearchCV and GridSearchCV. GridSearchCV embodies the concept of hyperparameter tuning by exploring a spec- ified grid of hyperparameter values for a given ML algorithm. It systematically evaluates the performance of the classifier for all combinations of hyperparame- ters using cross-validation. On the other hand, RandomizedSearchCV operates on the premise of hyperparameter optimization by randomly sampling hyperparame- ter values from specified distributions or ranges. It systematically evaluates the performance of the classifier for each sampled configuration using cross-validation. The performance difference between them can be quite large meanwhile the differ- ence in accuracy is not that big [27]. The search method of choice will therefore be RandomSearchCV when evaluating different jobs and how the parameters affect the result. GridSearchCV will however be used in order to get a baseline for the deviation between the two classifiers for this thesis. 2.1.5 Supervised learning Supervised learning is a type of ML paradigm where the classifier learns from labeled data, meaning each input in the dataset is associated with a corresponding output or target variable [28]. The goal of supervised learning is to learn mapping from input variables to output variables based on the labeled training data provided. This is shown by the mathematical function 2.1. f : x− > y (2.1) Where the data inputted and outputted into the function in 2.1 is formatted as follows in equation 2.2. {(x1, y1), (x2, y2)...(xn, yn)} (2.2) In supervised learning, the algorithm learns from examples, where it is presented with input-output pairs and adjusts its internal parameters to minimize the differ- ence between the predicted output and the actual output. In equation 2.2 the input is x and the corresponding output is y. For CI classification, y is the status of the finished job, either 1 for a successful job and 0 for an unsuccessful job. Parameter x is the input to the classifier, in the case of the CI job this can be the log data, files edited, commit author etc. In classification tasks, the output variable is a categorical value or class label. The goal is to classify input data into predefined categories or classes. Examples of clas- sification tasks include spam detection in emails, sentiment analysis in text data, and image classification. 2.1.6 Evaluation metrics Researches usually resort to using commonly accepted performance metrics while evaluating the classifier [29]. Some common ones are accuracy, area under the 15 2. Background curve (AOC), area under the ROC (receiver operating characteristic) curve, and F-measure. Although these will be used in the thesis for evaluating the work of other studies, Matthews Correlation Coefficient (MCC) and Dynamic Time Warp- ing (DTW) are used for evaluation of the results from this thesis. The F1 score was not used in the thesis as it focuses solely on the positive class and does not take into account the true negatives. This can be a limitation in sce- narios where the performance on the negative class is also important as in this thesis where the purpose is to be able to successfully predict CI builds that are about to fail. ROC was not used as it tells how well the model ranks positive instances rel- ative to negative ones, but not about the actual decision boundaries. It is also not particularly useful for imbalanced classes, hence the decision to use MCC instead and also why AOC was selected out. Dynamic Time Warping Dynamic Time Warping (DTW) is a technique used in the field of time series anal- ysis to measure the similarity between two sequences that may vary in time or speed. It is particularly useful when comparing sequences that may have temporal distortions, such as varying speeds, shifts, or nonlinear distortions. DTW finds an optimal alignment between the two sequences by warping the time axis, allowing for the comparison of corresponding points in the sequences, even if they occur at different times. Matthews Correlation Coefficient The Matthews Correlation Coefficient (MCC) is a metric used to evaluate the perfor- mance of classification algorithms, particularly in the context of binary classification tasks. It takes into account true positives, true negatives, false positives, and false negatives to provide a balanced measure of a classifier’s performance, even in cases of class imbalance. MCC ranges from -1 to +1, where different scores means the result can be in- terpreted as follows: • A score of +1 indicates perfect prediction, where there are no false positives or false negatives. • A score of -1 indicates perfect disagreement between prediction and observa- tion. • A score of 0 indicates random prediction. 2.2 Traditional machine learning Traditional machine learning encompasses a collection of algorithms and method- ologies employed to construct models capable of discerning patterns and generating predictions from data. These classifiers possess the ability to autonomously learn from data and make predictions or decisions without the need for explicit program- ming tailored to these tasks [30]. This section provides an overview of the inner workings of two classifiers, Random Forest (RF) and Gradient Boosting (GB). 16 2. Background 2.2.1 Random forest classifier Random Forest (RF) operates by constructing an ensemble of decision trees during the training phase. Each decision tree is built on a random subset of the training data, employing bootstrapping, or bagging techniques to introduce diversity among the trees [31]. Moreover, at each node of the decision tree, a random subset of fea- tures is considered for splitting, enhancing the robustness of the ensemble. A decision tree is built by asking questions about distinctive features and at the start, all data is in one big box, which is the root node. Then questions are asked to split the data into smaller groups based on the answers. These questions are chosen to maximize the information gain or minimize impurity at each step. As more questions are asked and data is split, branches are created that lead to more specific groups of data. Eventually, the tree ends up with smaller boxes, or leaf nodes, where no more questions are asked because the stopping point has been reached. In classification, each leaf node represents a class, and in regression, it represents a predicted value. When a prediction is made for a new data point, the algorithm starts at the root node and follows the branches based on the answers to the questions until a leaf node is reached. In classification, the majority class in that leaf node is the prediction, and in regression, it would be the average of the target values. Decision trees are simple and interpretable. It can easily visualize the decision- making process, understanding which features are most important for making pre- dictions. However, decision trees can also be prone to overfitting, especially if they grow too deep, capturing noise in the data instead of true patterns. In order to eliminate overfitting multiple trees are grouped into a forest. During the prediction phase, the ensemble of decision trees collectively contributes to the final output. For classification tasks, RF employs a majority voting mech- anism, while for regression tasks, it averages the outputs of individual trees. This aggregation strategy ensures robust and accurate predictions across diverse datasets. The strengths of RF lie in its ability to mitigate overfitting [31], handle high- dimensional datasets, and provide insights into feature importance. However, it may exhibit computational overhead, particularly with large datasets, and may not perform optimally in the presence of noise or outliers. Additionally, the interpretabil- ity of RF classifiers may vary depending on the context and complexity of the data. As noted by Nasir et al. [32] the RF algorithm excels in both classification and regression tasks, making it an outstanding choice for analyzing log data, especially textual logs, in this thesis. Its appeal lies in its capability to handle high-dimensional data efficiently while delivering excellent performance. Moreover, RF is adept at managing imbalanced datasets, a common scenario in CI job logs where failed logs contain information not typically present in successful CI job runs. 17 2. Background 2.2.2 Gradient boosting Gradient Boosting (GB) is a ML technique that is all about teamwork of different classifiers [33]. GB works by starting with a basic understanding of the problem, which is like having a simple initial classifier. This initial classifier might not be very accurate, but it gives a starting point. Then, it looks at where it is making mistakes. Next, it uses another simple classifier, like a decision tree, that is good at correcting mistakes made in the initial classifier. This other classifier focuses on the areas where the first classifier went wrong and helps improve the predictions [33]. The classifier gets adjusted based on the feedback from this new model, which means the classifier is gradually getting better at solving the problem. This process is then repeated, bringing in more classifiers one by one, each focusing on various aspects of the problem and helping refine the predictions further. With each iteration, it reduces the errors and gets closer to the correct solution However, GB will try not to rely too much on one specific classifier, because the classifier could lead to worse predictions. So, instead a concept called "learning rate," which controls how much GB listens to each classifier’s prediction [34]. This helps prevent over-reliance on any single classifier and keeps the predictions balanced and accurate. This process is then repeated until GB is satisfied with the predictions or until it has reached a predetermined level of accuracy. Finally, all predictions are combined, weighing them appropriately based on their quality of predictions, to arrive at the final prediction. 2.3 Deep learning Deep learning (DL) is a subset of ML that focuses on using neural networks with multiple layers to learn representations of data [35]. Unlike traditional ML clas- sifiers, which may require manual feature extraction, DL classifiers automatically learn hierarchical representations of data directly from the raw input. At the core of DL are neural networks, which are computational models inspired by the structure and function of the human brain instead of the traditional zeroes and ones used in normal computing tasks [36]. These networks consist of intercon- nected layers of nodes (neurons), where each node performs a simple mathematical operation on its inputs and passes the result to the next layer. Neural networks typically consist of an input layer, one or more hidden layers, and an output layer [36]. Each layer is composed of multiple nodes, and connections between nodes have associated weights adjusted during the training process. 2.3.1 Layers The input layer is the initial layer of a neural network that receives input data and passes it to the subsequent layers for processing. It consists of neurons that repre- sent the input features or dimensions of the data. The number of neurons in the input layer is determined by the dimensionality of the input data. For example, if 18 2. Background you are working with images where each pixel is a feature, the number of neurons in the input layer would be equal to the total number of pixels in the image. The input layer does not perform any computation itself; its primary function is to pass the input data to the subsequent layers, which are typically hidden layers, and even- tually to the output layer. A hidden layer refers to a layer in a neural network that sits between the input layer and the output layer. The term "hidden" implies that these layers are not di- rectly observable from the input or output data; they are intermediary layers where the neural network learns to represent features or patterns from the input data. The output layer is the final layer of a neural network architecture. It is responsible for producing the desired outputs or predictions based on the input data and the learned parameters of the network. The structure and configuration of the output layer depend on the type of task the neural network is designed to solve. During training, the dataset is divided into batches of a predefined size. Each batch contains a subset of the training data. DL processes data in batches and the classifier learn to adjust the weights of connections between nodes in order to mini- mize the difference between the predicted output and the actual output for a given input. This process, known as backpropagation [37], involves iteratively adjust- ing the weights using optimization algorithms such as stochastic gradient descent. Once all data has been iterated through one epochs is completed. After completing one epoch, the training process may continue with additional epochs if needed to further refine the classifier’s performance. Each epoch provides an opportunity for the classifier to learn from the entire dataset and improve its performance gradually. The depth of DL classifiers refers to the number of hidden layers in the neural network. Deeper networks have the capability to learn more complex data represen- tations but also require more computational resources and can be more challenging to train. 2.3.2 Activation functions An activation function is a mathematical operation applied to the output of each neuron in a neural network. It serves as a gate, determining whether the neuron should be activated or not, based on whether the neuron’s input meets a certain threshold. Activation functions introduce non-linearity, enabling the network to learn complex patterns and relationships within the text data. Fully connected lay- ers follow the convolutional layers, allowing the neural network to perform higher- level reasoning and decision-making based on the extracted features. These layers connect every neuron from one layer to every neuron in the next, facilitating a com- prehensive analysis of the text. ReLU The ReLU activation function has previously been shown to improve DL neural 19 2. Background networks [38]. The function works by outputting the max value of a function or 0. The ReLU function is preferred over other activation functions like sigmoid and tanh because it helps mitigate the vanishing gradient problem [39], which can occur during the training of deep neural networks. Additionally, ReLU tends to be com- putationally more efficient compared to some other activation functions. Sigmoid The Sigmoid function maps any real-valued number to a value between 0 and 1 [40]. It has an S-shaped curve, with an output close to 0 for large negative inputs, close to 1 for large positive inputs, and approximately 0.5 at input value = 0. This prop- erty makes it useful for tasks when trying to model probabilities or create binary classifications, as it can squash the output of a linear function into the range [0, 1], thus providing a probability-like interpretation. Sigmoid functions are less favored now due to issues like vanishing gradients, where the gradient becomes extremely small for very large or very small inputs, hindering effective learning during back- propagation. Tanh The Tanh function takes any real number as input and outputs a value between -1 and 1. It is essentially a rescaled version of the Sigmoid function, stretching from -1 to 1 instead of 0 to 1. Tanh has many of the same characteristics as the Sigmoid function in that it has an S-shape and that the function asymptotically approaches -1 as the input value approaches negative infinity and approaches 1 as the input value approaches positive infinity. In neural networks, the Tanh function is often used as an activation function in hidden layers because it is zero-centered and tends to make learning easier compared to the Sigmoid function, which suffers from the vanishing gradient problem. The zero-centeredness helps in mitigating issues related to vanishing gradients, making optimization more efficient. 2.3.3 Convolution Neural Network Convolutional Neural Networks (CNNs) are a specialized type of deep neural net- work architecture commonly applied in text-based classification tasks [41]. Unlike the traditional use of CNNs in image processing, CNNs for text analysis operate over one-dimensional sequences of words or characters instead of two dimensions as is the case with image processing. In text based CNNs, the convolutional layers apply filters of varying sizes [42] to the input text, capturing local patterns and features. Each filter will try to capture dif- ferent patterns in the input text to get a clearer picture when moving to subsequent layers. The output of applying a filter to the input data is called a feature map. Each filter produces one feature map, and the number of filters in a convolutional layer determines the depth of the output volume. The depth corresponds to the number of filters or feature maps. Convolutional layers are employed to extract local features or patterns from text 20 2. Background sequences that it trains the classifier on. One-way filters are used on text to show how a sentence is structured; this can be done by checking what tokens (tokenized sentence) are often beside each other. With the sequence from Listing 3 the convo- lutional neural network would learn that there is a pattern of a non-predetermined number of words and then a space in between. This is what is typically called a word in normal English. The tokens from listing 3 inputted into a CNN can be seen in Figure 2.2. In order to learn the pattern an operation called convolution operation is involved. It involves sliding a small filter (also known as a kernel or feature detector) over the input data and computing the dot product between the filter and the overlapping region of the input. In Figure 2.2 this is shown as the input layer to the bottom of the figure converges to the middle layer in the figure. In this case the kernel size of 2 means two values from the bottom layer converges to one value in the middle layer, as is shown by the first two values 5 and 8 both having lines drawn to the leftmost box in the middle layer. The result of this dot product is a single value in the output, referred to as the "feature map". The value for sliding can be controlled by setting the stride parameter that de- termines the step size at which the filter moves across the input data. A stride of 1 means the filter moves one token at a time, while a stride of 2 means it moves two tokens at a time. Larger strides result in smaller output volumes. In the case of Figure 2.2 a stride value of 1 means that the leftmost numbers 5 and 8 is computed first, 8 and 6 are thereafter computed. With a stride of 2 the values 6 and 11 would instead be computed after 5 and 8. After computing the dot product between the filter and the input, a non-linear activation function is typically applied element-wise to the resulting values in the feature map. After convolution, pooling layers are often used to reduce the spatial dimensions of the feature maps while retaining important information. One such pooling tech- nique is Max pooling and divides the input to it into a grid of non-overlapping regions and for each region takes the maximum value. This helps ensure the net- work is more computationally efficient and helps reduce overfitting. In Figure 2.2 the last step in the figure is pooling where the three inputs from the middle layer become one. With max pooling the largest of the three inputs would be the result, while the other two inputs would be discarded. Figure 2.2: CNN network. CNNs for text classification are trained using backpropagation, adjusting the weights 21 2. Background of the network to minimize the difference between predicted and actual class labels. Regularization techniques such as dropout and weight decay are often employed to prevent overfitting during training. Transfer learning can also be leveraged in text based CNNs, where pre-trained models trained on large text corpora are fine-tuned for specific classification tasks with smaller datasets. This approach enables the utilization of knowledge learned from the pre-trained models, enhancing performance and efficiency. 2.3.4 Long short-term memory Long Short-Term Memory (LSTM) networks are a type of recurrent neural network (RNN) architecture. RNN is a type of neural network architecture tailored to process sequential data. The network maintains an internal state, or memory, enabling the network to capture dependencies between elements within a sequence, how this is done can be seen in Figure 2.3. One problem with RNN is the vanishing gradient problem [43] which occurs when the gradients used to update the weights of the network during the training process become extremely small as they are propagated backward through the network layers. Figure 2.3: Basic RNN structure. LTSM is specifically designed to address the issues of capturing long-term depen- dencies and avoiding the vanishing gradient problem that often arises in traditional RNNs [44]. At the core of an LSTM unit is the concept of a "cell state," which acts as the memory of the network. This cell state can maintain information over long sequences, enabling the network to retain context and dependencies over time. LSTM units have three main components: the input gate, the forget gate, and the output gate [44]. Each gate is essentially a set of mathematical operations driven by neural network layers, particularly sigmoid and tanh layers. 22 2. Background At each time step, the LSTM decides what information to discard from the cell state using the forget gate. This gate considers the previous cell state and the cur- rent input and outputs a value between 0 and 1 for each component of the cell state, where 0 means "completely forget" and 1 means "completely remember". Input Gate: Simultaneously, the input gate determines what added information to store in the cell state. It involves two stages, a sigmoid layer that decides which values to update and a tanh layer that creates a vector of new candidate values that could be added to the cell state. The forget gate and the input gate work together to update the cell state. The forget gate decides what information should be removed or retained, while the input gate decides what added information should be added. Finally, the output gate controls what information from the cell state should be output at the current time step. This gate filters the cell state through a sigmoid layer and produces the final output of the LSTM unit. 23 2. Background 24 3 Related work Several prior studies have explored the feasibility of anticipating the outcome of a CI job before its execution. These studies consider various parameters including mod- ified files, edited lines of code, commit messages and more to predict the outcome. Beyond prediction on just CI jobs is other research on other types of log prediction that could possibly be used in order to predict CI jobs as well. Moreover others have not only investigated how to use predicitons in CI but in other areas related to improving developers workflow. 3.1 Traditional machine learning in CI Saidani et al. suggests employing an evolutionary search algorithm [45], partic- ularly focusing on the Non-dominated Sorting Genetic Algorithm (NSGA-II), for their research on cutting the CI job build time. Furthermore, they wanted to pro- vide developers with support tools for setting up the rules characteristically for the developer CI system. This algorithm operates based on predefined rules established by developers manually adding them for various pipelines. These rules may encom- pass factors such as the number of changed files, the magnitude of changes, and so forth. Their optimal outcome yielded an AUC score of 0.78, accurately labeling data. Although the algorithm shows promise, this thesis will focus on making pre- diction in CI logs while the CI job is building, as such it would not be a feasible algorithm to use. Xia et al. [46] looked at how it would be possible to predict if a CI build would succeed given the result from 126 different projects that contained almost 300.000 previous builds. They employed nine distinct classifiers, including decision tree, GB, logistic regression, multilayer perceptron, multinomial Naive Bayes, nearest centroid, nearest neighbors, RF, and ridge regression, to discern the most effective model. However, the findings were not encouraging, with none of the classifiers achieving an F1 score exceeding 0.6, indicating subpar performance according to the authors. The AUC score fluctuates more with scores as high as 0.8 and as low as 0.5. Notably, RF emerged with the highest F1 score in 0.4 of all projects among the classifiers tested. In terms of the AUC score, GB exhibited the most promise, leading in 75% of all projects assessed by the authors. Although the results were not promising as noted by Xia et al., the authors used the classifiers before any CI job had started. Therefore, the results could improve utilizing it while the CI job is running. 25 3. Related work Several other studies has also looked at other ways traditional ML can be used in order to help with integration of code. One such study by E. Knauss et al. [47] looked at how it is possible to minimize the CI build time instead of predicting the build outcome. They researched how ML can be used to find out what test cases to select when running a test suite in CI. The goal of this was to reduce the time tests take in the CI build process. The method used involved using a statistical model that connects changed frag- ments of code, also called code churns, as well as results of tests. The results show that using the approach could in the best case provide a 3.71 times speed up com- pared to running all test cases. In the worst case a 1.1 times speedup was achieved. This was in the context of one company, so this could vary between different com- panies but it still shows that there are other ways of improving the feedback time to developers without the use of CI build outcome prediction. Another way to find failures before even starting a CI build is with code reviews. M. Staron et al. [48] state several problems with code reviews. Firstly that humans tends to focus on the wrong things when reviewing code, stating that this can be solved with filtering out commits that does not require manual review. Secondly, reviews can slow down the integration process in the case of long discussions were longer breaks removes the work flow. Also stating that configuration files are often part of the review, but can be hard for humans to read. One possible way to solve this issue is with automating the code review workflow, although full automation is not desired as it can for example hinder knowledge transfer [48]. In order to find out what review comments are actually needed, the researchers classified what can be viewed as a positive or negative comment, correct classification in 72% of cases and incorrect classification of 28%. These two papers show that not only prediction on the CI build outcome can help with the integration process but also other areas such as reducing CI build time through test case selection and making code reviews easier. 3.2 Deep learning in CI Numerous studies have explored the application of DL in predicting CI build out- come. Saidani et al. [49]. conducted a study to evaluate the efficacy of deep learning using the same Travis CI dataset employed by Xia et al. [46]. Unlike previous re- search that employed various DL algorithms, this study focused solely on employing the LSTM classifier, complemented by a genetic algorithm for hyperparameter tun- ing. The findings revealed a more promising outcome, achieving an AUC score of 80% in certain scenarios, though results varied significantly, sometimes reaching as low as 50%. The authors highlighted the considerable computational expense of the algorithm 26 3. Related work but also acknowledged the potential for further optimization. The outcomes ex- hibited significant variation when applied to the Travis CI dataset; however, given its favorable findings, leveraging LSTM for near-real-time log analysis could prove advantageous. When employing DL classifiers, they are often trained based on the structure of successful CI job logs compared to failed job logs. Failed CI job logs frequently con- tain anomalies when compared to successful CI jobs, rendering algorithms proficient in anomaly detection suitable for CI job log classification. Siyang et al. [50] explored three distinct DL clasifiers for anomaly detection within a Hadoop Distributed File System housing extensive logs exceeding 1GB. Evaluation of these algorithms yielded a Conventional Neural Network and Multilayer Perceptron achieving an impressive accuracy, F1 score, and AUC of 99%. In contrast, LSTM scored 95%. While this thesis diverges from the studie’s focus, it is noteworthy that CNN signif- icantly outperformed LSTM. The authors emphasize CNN’s suitability for classifi- cation tasks like the one in the study, as it addresses overfitting by capturing local semantic information rather than global data. Two key reasons underpin CNN’s superiority compared to MLP for CI job runs. Contextual learning in CNN: CNN models incorporate both horizontal em- bedding codes and longitudinal entries in logs through 2-dimensional convolution operations. This means that CNNs can learn the correlations between various log entries more effectively compared to MLPs. MLPs, on the other hand, have a simpler and faster training procedure but do not utilize contextual information as effectively as CNNs. Relationships in log context with CNN: The CNN classifier can extract more relationships within the log context by leveraging multiple filters. Each line of parsed data belongs to a unique identifier group and is structured in a sorted order based on identifiers related to log events. This structured data presentation allows CNNs to extract meaningful relationships, such as the sequence of log events (e.g., "Job start" before "Job is running"). The CNN’s convolutional layers can effectively extract these related features, leading to improved performance compared to MLPs. 3.3 Just-in time defect prediction Kamei et al. [51] discusses three major problems with CI build outcome predictions. The first issue arises when predictions are made at a high level, leading developers to be uncertain about the causes of predicted outcomes. Secondly, if a particular file is predicted to contain a bug, it is essential to contact an expert responsible for that file. However, in some cases, more than 100 developers may be working on the same file, making it hard to know who to contact. Lastly, predictions are often made too late in the cycle, as developers aim to detect bugs as quickly as possible. In contrast to the approaches prevalent at the time of publishing, Kamei et al. [51] 27 3. Related work propose a new approach called just-in-time, which aims to prompt action at an ear- lier stage than previous methods. When employing the just-in-time paradigm, it can be triggered immediately upon a commit being made to a repository. Just-in-time refers to a principle or methodology that emphasizes delivering necessary resources or actions precisely when they are needed during the development process. By using just-in-time, the authors believe they address the three main problems. Firstly, the first issue is resolved by searching for each new detail in the commit instead of at the package/file level, resulting in smaller changes for developers to analyze. Sec- ondly, each new commit has only one person attached to it, meaning that person is responsible for fixing it. Lastly, immediate feedback is provided as each commit is checked upon being committed to a repository. To test the new approach, Kamei et al. [51] used a logistic regression algorithm where predictions above 0.5 were classified as defects, while others were classified as non-defects. The results of the study showed that only 20% of all bugs were found, leading to a poor result according to the authors. The authors further speculate that this may be due to checking if the change has introduced a defect or not, while others with better results check whole files and ignore which change caused the de- fect in the first place. The problem addressed by Yang et al. [51] is whether it is possible to detect bugs in code written by developers. Their proposed solution, explored in their paper, involves utilizing DL with a novel approach named Deeper. The study aims to build upon the work previously conducted by [51]. Instead of relying on traditional ma- chine learning methods like logistic regression, Yang et al. employed DL techniques, specifically utilizing the Deep Belief Network algorithm. Their new approach was then compared against the method proposed by Kamei et al. [51]. The results showed a significant improvement, with Yang et al.’s approach detecting 32.22% more bugs on average (51.04% versus 18.82%). Recent advancements in just-in-time defect prediction have seen the adoption of novel algorithms aimed at enhancing prediction accuracy. Among these, the SZZ algorithm, pioneered by Sliwerski et al. [52], stands out as one of the most preva- lent for identifying defect commits. Subsequent studies have explored variations of this original algorithm, such as those introduced by Kim et al. [53] and Da Costa et al. [54]. Yuanrui et al. [55] conducted a comprehensive analysis of these ap- proaches to assess their susceptibility to mislabeling changes as either bug-fixing or non-bug-fixing. Their study revealed that both the Sliwerski et al. and Kim et al. methodologies exhibited high false positive and false negative rates. Specifically, the former recorded false positive rates ranging from 2% to 10% and false negative rates between 13% and 20%. Conversely, the approach by Kim et al. yielded false negative rates of 1% to 10% and false positive rates of 16% to 45%. Notably, Da Costa et al. achieved the most promising results, with false positive and false nega- tive rates falling within the range of 1% to 6%. Most defect prediction studies focus on identifying defects in commits before a CI 28 3. Related work job runs, often without explicitly mentioning CI. However, they remain valuable because defect prediction can facilitate the resolution of issues before they reach the CI pipeline. This preemptive fixing could potentially render the subject of the the- sis redundant. Nonetheless, existing defect-finding algorithms typically only detect around 50% of defects and are susceptible to significant levels of false positives and false negatives. 3.4 Bringing feedback to developers With the predictions that are made by algorithms, it is also important that this information is displayed to the developer in a way that is easy for the developer to understand. Multiple different approaches have been suggested but all of them have focused on prediction algoritms that are done before the CI job has started. One such implementation is a standalone application with a frontend for devel- opers. Rosen et. al. [56] introduced Commit Guru, a web-based tool capable of analyzing commits within a repository. The tool tries to solve the problem of pre- diction tools not being used and the authors states that one significant reason for its limited adoption is the absence of tools that integrate state-of-the-art analytics and prediction techniques. It discerns which commits introduced bugs and which ones remedied them, alongside providing insights like total commit count and metrics indicating the level of risk associated with commits. One limitation of the standalone web application approach is its lack of integra- tion with git repositories [57]. Stating that there has been a growing concern among researchers regarding the necessity for software analytics to be both explainable and actionable. In order to remedy the shortcomings JitBot was created which is an acronym for a just-in-time defect prediction bot developed by Khanan et. al. [57]. The bot utilizes historical commits as training data to predict defects. Once a pull request is made, JitBot sends its predictions back to the developer, extracting necessary features from the data. It then comments on the pull request, explain- ing its predictions and suggesting potential mitigations. The authors illustrate two examples: in one, the bot identifies moderate risk due to developers’ lack of experi- ence and a high number of deleted lines. In another example of low risk, risks stem from a high number of modified directories and frequent changes to the files involved. This thesis aims to address an important concern: the lack of evaluation regard- ing how developers would utilize such a system and whether they find it beneficial in their workflow. Additionally, the concept of "low experience" used in the paper [57] remains undefined, casting doubt on the effectiveness of their model. Another approach to providing feedback to developers involves developing an ex- tension to an Integrated Development Environment (IDE), as demonstrated by Kawalerowicz M. and Madeyski L. [58]. Their system, Jaskier, operates with a server responsible for training and predictions, along with a database for storage. Whenever a file is saved within the IDE, change statistics are transmitted to the 29 3. Related work prediction server, which then sends the prediction back to the developer’s IDE. The authors primarily focus on Visual Studio, which presents a graphical user interface indicating the likelihood of a file containing defects. For VS Code, this information is conveyed via the output console along with a corresponding percentage on fail probability. This thesis seeks to analyze the different methods above to ascertain developers’ preferences for receiving feedback from a prediction system. Moreover, as it also delves into predictions during CI build runtime, a topic largely unexplored in exist- ing feedback, other potential solutions are also explored. 30 4 Research Design The previous chapter outlined others’ existing work and sets out a baseline for what result the thesis aims to achieve. This chapter follows the design science research (DSR) design paradigm [59]. It has more specifically followed the guidelines for conducting DSR in a master’s thesis [60]. • RQ1: To what degree is it possible to provide feedback to the developer based on predicting a CI job build outcome? – RQ1.1: To what degree is it possible to predict the result of a CI job in just-in-time based on previous data from the same pipeline? – RQ1.2: What is the relationship between the computational power needed to predict the build outcome in just-in-time and the size of the job pa- rameters? – RQ1.3: How can the feedback on the most probable failure cause from the just-in-time prediction best be displayed to the developer? – RQ1.4: What is the perceived usefulness of just-in-time build outcome predictions for software developers? – RQ1.5: At what point in the build process is it no longer useful to continue making predictions? • RQ2: To what extent can the problems be solved by the potential solutions in (RQ1)? Each research question posed defines one cycle, therefore the thesis features two cycles in total. The initial cycle delves into potential solutions for addressing these problems. Finally, the second cycle evaluates the efficacy of the solutions proposed in the second cycle in resolving the issues identified in the first cycle. The subsequent subsections delve deeper into the handling of each research question. Specifically, Section 4.1 addresses RQ1, while the subsequent section tackle RQ2. 4.1 Solution In this thesis, the solution cycle was broken down into smaller iterations, each aimed at generating a distinct contribution to the artifact. These findings encompassed various evaluated combinations for the different classifiers, tokenizers, models as well as hyperparameters used. Also evaluated was their performance measured in terms of accuracy, MCC and required computational resources and the subsequent analysis of gathered information. The insights gleaned from analyzing the findings contributed to the subsequent iteration, thereby enhancing the iterative process. 31 4. Research Design Each iteration commenced with collaborative planning between the academic and industrial supervisors to outline the tasks to be undertaken. These plans were flexi- ble, serving as rough guidelines rather than rigid directives. Adjustments were made as necessary, particularly based on the outcomes of earlier evaluations within the iteration, which could influence subsequent stages. The second part of each itera- tion consisted of building a prototype and evaluating different combinations for the artifact. 4.1.1 Machine learning model The first part of the first cycle involved evaluation of different ML models that could best be used for CI build outcome prediction while the CI job is running. With ML there are a plethora of parameters that can affect the result and as such it is of utmost importance to evaluate different configurations for best performance. The different combinations comprised of various methods for improving the result such as parameter tuning, data handling and data balancing. The dataset for this study comprised of logs sourced from Zenseact’s CI system, predominantly containing data related to the development of safety system software for automotive applications. The system developed initially operated by employing a downloader capable of re- trieving log files saved upon the completion of each real CI job. The downloaded metadata included the log file itself along with its corresponding status, denoting either a pass or fail outcome. This data served as input for the models utilized in the prediction system. Where the process was structured as follows by the bullet list and is connected to the flowchart in 4.1. Each number in the flowchart links to the explanation in the bullet list. 1. One year of job logs is downloaded from the specified job. 2. The downloaded CI job data is loaded, depending on the balancing used dif- ferent data is loaded. 3. Data is preprocessed and unnecessary elements from the logs are removed if necessary. These can for example be timestamps. 4. The preprocessed data is tokenized with a tokenizer chosen for that build. 5. Lastly in the training phase, the tokenized data is handed over to a classifier which returns the trained classifier. 6. In the first step of simulating predicting a pipeline test data is loaded in iterations based on the number of lines to predict on for each prediction. This data follows the same steps as 3 and 4 for it to be handed over to the same classifier returned in step 5. 7. In this case the data returned will be a 0 if the classifier predicts the job will fail, otherwise it returns 1 if it predicts the job will succeed. 8. The predictions is evaluated against the real outcome of the job. 32 4. Research Design Figure 4.1: Flowchart of what is explained in the bullet list. The concluding phase of each iteration involved analyzing the artifact and determin- ing the course of action for the subsequent iteration, a process facilitated by collabo- ration with both industrial and academic supervisors. Each new finding served as a means to identify shortcomings and areas for improvement. Subsequently, planning for the next iteration commenced, with the primary aim being to address the identi- fied shortcomings. Given the time-intensive nature of evaluating each combination, it was imperative to prioritize those combinations likely to yield meaningful data. Consequently, numerous combinations had to be discarded, resulting in a selection of only a few combinations to carry forward to the next iteration. 4.1.2 Interface for developers Numerous methods exist for CI systems to deliver feedback to developers, and to determine the most effective ways according to developer preference, multiple de- 33 4. Research Design signs and approaches were compared for providing feedback from predictions. Two distinct approaches were employed in selecting which solutions to present to develop- ers. The first involved an examination of the current feedback mechanisms utilized at Zenseact within their CI system. The second approach leveraged prior studies and the methodologies employed therein. The adoption of these two approaches resulted in the creation of several mockups available in the interview guide found in the Appendix A, which were subsequently evaluated through interviews with developers at Zenseact. This process aimed to ascertain whether a consensus exists regarding the preferred approach or if develop- ers hold differing opinions on the matter. The selection of interviewees adhered to purposeful sampling, involving the identifi- cation of participants with expertise or experience in the field of developing software as well as those who were available and willing to engage, with an exemption to CI developers. Each interview spanned approximately 30 to 50 minutes, during which consent for data collection was obtained, and measures to maintain confidentiality were assured. Originally, the interviews were intended to exclusively involve develop- ers working on the primary product, specifically self-driving technology. But in the end one scrum master and product owner was added in order to gain a broader view as they knew how the teams operate. All participants with their prior experience can be found in Table 4.1. Table 4.1: Performance impact on training with different classifiers, models and tokenizers. Name Role Years at Zenseact Years as CI developer Interviewee A Developer 1.5 No Interviewee B Developer 7 No Interviewee C Developer 3.5 0.5 Interviewee D Scrum master 3 2 Interviewee E Developer 2 No Interviewee F Product owner 7 0.5 4.2 Evaluation to evaluate the data gathered the thesis has employed a combination of a qualitative and quantitative evaluation approach. Research questions 1.1 and 1.2 have been analysed with the help of computational experiments and a quantitative evaluation approach. In order to answer questions RQ1.3, RQ 1.4 and RQ 1.5 a qualitative approach has been implored using the interviews as the base. RQ2 has subsequently been examined both using a qualitative and quantitative evaluation approach. 34 4. Research Design 4.2.1 Computational experiments During the analysis phase of each iteration for building the machine learning model hypothesis testing has been used. In some of the iterations time series analysis has also been used when involved with comparing the results from simulating a CI build. The u