Automation and Orchestration for Machine Learning Pipelines A Study of Machine Learning Scaling: Exploring Micro-service architecture with Kubernetes Master’s thesis in Complex Adaptive Systems FILIP MELBERG VASILIKI KOSTARA DEPARTMENT OF PHYSICS CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2024 www.chalmers.se www.chalmers.se Master’s thesis 2024 Automation and Orchestration for Machine Learning Pipelines A Study of Machine Learning Scaling: Exploring Micro-service architecture with Kubernetes FILIP MELBERG VASILIKI KOSTARA Department of Physics Chalmers University of Technology Gothenburg, Sweden 2024 Automation and Orchestration for Machine Learning Pipelines A Study of Machine Learning Scaling: Exploring Micro-service architecture with Kubernetes FILIP MELBERG, VASILIKI KOSTARA © FILIP MELBERG, VASILIKI KOSTARA, 2024. Supervisors: Hamid Ebadi, Infotiv Technology Development & Giovanni Volpe, Department of Physics Examiner: Giovanni Volpe, Department of Physics Master’s thesis 2024 Department of Physics Chalmers University of Technology SE-412 96 Göteborg Sweden Telephone + 46 (0)31-772 1000 Chalmers digital printing Gothenburg, Sweden 2024 iv Automation and Orchestration for Machine Learning Pipelines A Study of Machine Learning Scaling: Exploring Micro-service architecture with Kubernetes FILIP MELBERG, VASILIKI KOSTARA Department of Physics Chalmers University of Technology Abstract Although Machine Learning (ML) has been around for many decades, its popu- larity has grown tremendously in recent years. Today’s requirements show a great need for the development and management of ML projects beyond algorithms and coding. The aim of this thesis is to investigate how a minimal team of engineers can create and maintain a ML pipeline. To this end, we will explore how a Machine Learning Operations (MLOps) pipeline could be created using containerization and container orchestration of micro-services. After relevant research, the result is a minimal, on-premises Kubernetes cluster set up on physical servers and Virtual Machines (VMs) running the Ubuntu Operating System (OS). The cluster consists of a master and two worker nodes, which are used for two main ML frameworks. Populating the cluster with more nodes is straightforward, which makes scaling a simple task. Additionally, a locally shared folder on the network is mounted in the cluster as an external storage and the cluster is configured to access either a local or a cloud-provided container registry. Once the cluster is set up and run- ning, an application is launched to train the YOLOv5 model on a custom dataset. Later, Distributed Data Parallel (DDP) training is performed on the cluster using PyTorch, TorchX, PyTorch Lightning and Volcano. Keywords: DevOps, Docker, Kubernetes, Micro-service, ML, MLOps, PyTorch, YOLO v vi Acknowledgements The research for this master thesis was carried out within the SMILE IV project, financed by Vinnova, FFI, Fordonsstrategisk forskning och innovation under the grant number 2023-00789 [75]. We would like to express our deepest gratitude to our project supervisor Dr. Hamid Ebadi, Senior Researcher and Competence Leader at Infotiv Technology Development, for his invaluable guidance and his constant engagement in our thesis project. We would also wish to extend our gratefulness to Maria Kindmark Alemyr, Consultant Manager at Infotiv Technol- ogy Development, as well as the company personnel, for providing beneficial advise and access to crucial resources, such as software tools and technical infrastructure. Finally we want to thank our supervisor and examiner at Chalmers University of Technology Giovanni Volpe, Senior Lecturer at Institution of Physics at Gothen- burg University, for his direction concerning administrative processes. Filip Melberg and Vasiliki Kostara, Gothenburg, June 2024 vii viii Contents List of Figures x List of Tables xii List of Abbreviations xiii 1 Introduction 1 2 Background 5 2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 DevOps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 MLOps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Containerization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4.1 Container Runtimes . . . . . . . . . . . . . . . . . . . . . . 12 2.4.2 Container Runtime Interface . . . . . . . . . . . . . . . . . . 12 2.4.3 Container Registry . . . . . . . . . . . . . . . . . . . . . . . 13 2.5 Container Orchestration . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5.1 Kubernetes (K8s) . . . . . . . . . . . . . . . . . . . . . . . . 14 2.6 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.6.1 Version Control (Git) . . . . . . . . . . . . . . . . . . . . . . 22 2.6.2 Virtualization (VirtualBox) . . . . . . . . . . . . . . . . . . 23 2.6.3 File Sharing (Samba) . . . . . . . . . . . . . . . . . . . . . . 23 3 Methods 25 3.1 High Level Requirements . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Simple CI Pipeline on GitLab using CML . . . . . . . . . . . . . . 26 3.3 Kubernetes pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.1 Infrastructure Overview: Servers and Network . . . . . . . . 27 3.3.2 Requirements for Setting up a Cluster with kubeadm . . . . 28 3.3.3 Setting up a Cluster with kubeadm . . . . . . . . . . . . . . 30 3.3.4 Crafting the First Pipeline Component . . . . . . . . . . . . 31 ix 3.3.5 Creating External Storage . . . . . . . . . . . . . . . . . . . 32 3.3.6 Configuring Container Registry . . . . . . . . . . . . . . . . 33 3.3.7 Running the First Pipeline Components in a Kubernetes Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.8 Kubernetes Dashboard . . . . . . . . . . . . . . . . . . . . . 34 3.3.9 Distributed Training in a Kubernetes Cluster . . . . . . . . 34 4 Results 37 4.1 Simple CI Pipeline on GitLab using CML . . . . . . . . . . . . . . 37 4.2 The Finalized Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.1 Object Detection Training and Testing . . . . . . . . . . . . 39 4.2.2 Distributed Data Parallel Training . . . . . . . . . . . . . . 39 5 Discussion 43 6 Conclusion 47 A Source Code Repositories I A.1 Thesis Code Repository . . . . . . . . . . . . . . . . . . . . . . . . I A.2 Code Repository for SMILE-IV . . . . . . . . . . . . . . . . . . . . I B CML Pipeline III B.1 .gitlab-ci.yml . . . . . . . . . . . . . . . . . . . . . . . . . . . . III x https://github.com/infotiv-research/Kubernetes-MLOps/ https://github.com/infotiv-research/infobot/ List of Figures 1.1 Components of an ML project according to [20] and [51]. . . . . . . 2 1.2 Intersections between ML/MLOps engineers and contributing roles as illustrated by Kreuzberger, D., Kühl, N., and Hirschl, S. [29] . . 3 2.1 During the DDP process the dataset is split and each partition is processed by a separate worker. The workers compute gradients which are synchronized in a main server [41]. . . . . . . . . . . . . . 7 2.2 During synchronous updating the main server waits for all workers to complete their calculations [41]. . . . . . . . . . . . . . . . . . . . 8 2.3 When updating asynchronously the main server updates the model instantly when receiving a computed gradient [41]. . . . . . . . . . . 9 2.4 The figure illustrates different DevOps steps in an infinite loop [10]. 10 2.5 ML lifecycle as illustrated by Neil Analytics [35]. . . . . . . . . . . . 11 2.6 MLOps implementation principles and components as illustrated by Kreuzberger, D., Kühl, N., and Hirschl, S. [29]. . . . . . . . . . . . 11 2.7 The kubelet (Kubernetes node agent) is a componenet that commu- nicates with the container runtime through the Container Runtime Interface (CRI) [11]. . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.8 Figure illustrating a Kubernetes cluster with four nodes and their main components (One control plane and three workers). [44] . . . 15 2.9 Figure illustrating Kubernetes pods. The single container pod (Pod 1) is the most common although pods can contain multiple contain- ers and volumes. The illustration is inspired by [69]. . . . . . . . . . 16 2.10 Figure illustrating two Kubernetes services ”web service” and ”auth service”. Groups of pods are exposed on the network through ser- vices [21]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.11 The external storage is outside of the cluster. It is defined in the cluster with a PersistentVolume (pv) and then a PVclaim (pvc) is bound to it. The PVclaim defines how much storage the pods can consume. [5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.12 Volcano works on top of Kubernetes [76]. . . . . . . . . . . . . . . . 22 2.13 A type of virtualization. . . . . . . . . . . . . . . . . . . . . . . . . 23 xi 3.1 Network topology diagram. . . . . . . . . . . . . . . . . . . . . . . . 27 4.1 CI pipeline logs on GitLab using the simple CML model. . . . . . . 37 4.2 Confusion matrix depicting the performance of the simple CML model, saved as a CI artifact. . . . . . . . . . . . . . . . . . . . . . 38 4.3 Parts of the pipeline used for object detection. . . . . . . . . . . . . 39 4.4 Parts of the pipeline used for DDP training. . . . . . . . . . . . . . 40 4.5 It almost took an hour to train on the MNIST data set with one worker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.6 Training on two workers significantly reduced the training time to approximately 3 minutes. . . . . . . . . . . . . . . . . . . . . . . . . 41 xii List of Tables 3.1 VM specifications. . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 xiii xiv List of Abbreviations AI Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 API Application Programming Interface . . . . . . . . . . . . . . . . . . . . 19 AWS Amazon Web Services . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 CI Continuous Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 CI/CD Continuous Integration/Continuous Deployment . . . . . . . . . . . 1 CML Continuous Machine Learning . . . . . . . . . . . . . . . . . . . . . . 5 CNN Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . 6 CPU Central Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . . 28 CRI Container Runtime Interface . . . . . . . . . . . . . . . . . . . . . . . xi DDP Distributed Data Parallel . . . . . . . . . . . . . . . . . . . . . . . . . v DevOps Development and Operations . . . . . . . . . . . . . . . . . . . . . 1 DHCP Dynamic Host Configuration Protocol . . . . . . . . . . . . . . . . . 28 GKE Googles Kubernetes Engine . . . . . . . . . . . . . . . . . . . . . . . 44 GPU Graphics Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . 27 GUI Graphical User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . 21 IP Internet Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 K8s Kubernetes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 LAN Local Area Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 LLM Large Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . 5 MAC Media Access Control . . . . . . . . . . . . . . . . . . . . . . . . . . 28 ML Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v xv MLOps Machine Learning Operations . . . . . . . . . . . . . . . . . . . . . v NetBIOS Network Basic Input/Output System . . . . . . . . . . . . . . . . 32 NN Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 OS Operating System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v RAM Random-access Memory . . . . . . . . . . . . . . . . . . . . . . . . . 17 SCM Source Code Management . . . . . . . . . . . . . . . . . . . . . . . . 22 SPPF Spatial Pyramid Pooling Fast . . . . . . . . . . . . . . . . . . . . . . 6 TCP/IP Transmission Control Protocol/Internet Protocol . . . . . . . . . . 24 VM Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v YOLO You Only Look Once . . . . . . . . . . . . . . . . . . . . . . . . . . 6 xvi Chapter 1 Introduction The use of Artificial Intelligence (AI) by professionals and communities is be- coming increasingly prevalent, encompassing a wide range of applications, from complex challenges to everyday entertainment purposes. A study conducted by Stack Overflow based on their 2023 developer survey [54] revealed that all types of professional developers are either planning to or are already utilizing AI tools in their projects. Furthermore, the overall sentiment towards AI is overwhelmingly positive [38]. With the rapid growth of AI, however, an important drawback is overtime maintenance and evolution. Sculley et al. [51] discusses how realization of ML projects bear technical debts for various reasons, aiming to raise awareness about the necessary commitments and best practices regarding ML projects. These issues rise due to challenges that consist of more than developing ML code, when developing and operating ML systems. In fact, according to the same publication, only a small fraction of the applicable ML projects consist of ML code, marked by the dark box in Figure 1.1, while the rest of the components can be quite vast and complex. Instead, as the representation in Figure 1.2 suggests, a ML project may actually require the adoption of many roles with different skills, such as data science and Development and Operations (DevOps) engineering among others [29]. Therefore, MLOps are practices that strive to narrow the distance between these roles and potentially combine the required skills of an ML project. MLOps are similar in principle to DevOps, with focus on ML. They incorporate automation and monitoring of ML pipelines, often end-to-end, and help optimize workflow and minimize errors [47]. Despite this, MLOps is a fairly new subject and poses limitations, which long existing DevOps surpass. Currently, many software development researchers and engineers are executing Continuous Integration/Continuous Deployment (CI/CD) processes manually, in- cluding data analysis, model training and validation [20]. This approach is prone to inefficiencies, errors and delays in a ML pipeline, such as varying performance 1 Figure 1.1: Components of an ML project according to [20] and [51]. during training and deployment. In response to these issues, we proposed utiliza- tion of DevOps and MLOps as a robust, systematic and automated approach to managing the life cycle of an ML project. By including containerized training and testing stages in an ML pipeline, we implemented the foundations of a seamless process characterized by reliability and can be further developed and enhanced in the future. Inspired by the vast variety of options, we researched literature and software solutions that are applicable in both an academic and a corporate environment. Our main implementation includes a pipeline with containerized ap- plications using Git, Docker and Kubernetes. The use of these popular open source DevOps tools ensures the universality of our thesis and potential benefits to future implementation. To determine which methods would be suitable for our thesis, we also experimented in learning about relevant open-source ML projects. Although successfully configuring and containerizing a ML project can be time-consuming and challenging, it is ultimately worth it. The main goal of this thesis is to learn and engineer the parts of an ML project that does not involve developing algorithms and code (see Figure 1.1). Instead we wish to gain a global overview of what a full ML project can entail and gain experience assuming different roles (see Figure 1.2). With this objective, we ex- plored and questioned: Is a small team of engineers sufficient to achieve a fully functional end-to-end ML pipeline and, if so, what are the high-level criteria it should fulfill? This report is sectioned in five additional chapters. Chapter 2 discusses primary definitions and theoretical background of material relevant to this thesis, followed by Chapter 3, which presents the realization of the material. Afterwards, Chapter 2 4 shows the results of our methodologies, while Chapter 5 discusses the limitations posed during the thesis and suggests possible material for future consideration. Finally, Chapter 6 provides a conclusion to this thesis report. Figure 1.2: Intersections between ML/MLOps engineers and contributing roles as illustrated by Kreuzberger, D., Kühl, N., and Hirschl, S. [29] 3 4 Chapter 2 Background In this chapter, definitions and theoretical background are provided regarding the principles and software relevant to the scope of this thesis. In particular, this chapter mainly discusses MLOps and DevOps, VMs, containers and their orchestration with particular focus on Kubernetes and, finally, file sharing. 2.1 Machine Learning ML comprises the greatest subset of AI. Inspired by the human learning processes, ML algorithms have become a key component to many solutions both in academic research and corporate applications. From Large Language Models (LLMs) to ob- ject detection and many more applications, AI projects can employ a big variety of ML and Neural Networks (NNs) to achieve results tailored to contemporary de- mands. What is more, entire online communities, such as Hugging Face [23] and Kaggle [25], are constituted with sole interest in AI and ML. As a result, open source AI software is readily available by many corporations, while ML libraries and platforms like PyTorch [39] and TensorFlow [1] offer algorithms and code to suit a majority of ML projects. During this thesis, ML work published in online communities often brought in- spiration and new knowledge. Initially, a framework called Continuous Machine Learning (CML) was tried, which offered the option of integrated CI/CD with ML. However, the main part of this thesis consists of object detection of forklifts and people in relation to the SMILE IV project [75], which hosts code for a warehouse simulation. As a last interesting and advanced ML practice, DDP training has been implemented for a Kubernetes cluster, a software tool that will be discussed in Section 2.5. These parts use mainly PyTorch, as well as other tools, which will be described in more detail below. 5 Continuous Machine Learning (CML) CML is an open source software tool utilized for implementing CI/CD pipelines in machine learning projects [14]. In particular, CML is a library of functions used inside CI/CD runners to make ML compatible with popular distributed version control platforms. It can be used to develop ML workflow, such as model training and testing, compare ML models and monitor changes in datasets. You Only Look Once (YOLO) You Only Look Once (YOLO) is an object detection algorithm that employs deep learning in computer vision and, since it was first introduced, it has been devel- oped to offer high speed and accuracy, while being simple and straightforward. Specifically, YOLO is a one-stage detector [55] using a Convolutional Neural Net- work (CNN) architecture, that improves the result with each iteration and predicts spacial association of bounding boxes with class probabilities [48]. Until today, sev- eral versions of YOLO have been released, however, the fifth version YOLOv5 is of particular relevance to this thesis. YOLOv5 uses several convolutional layers including a Spatial Pyramid Pooling Fast (SPPF), which ensures computational efficiency and captures information at various scales, followed by upsampling to increase resolution [74]. This fast and powerful version uses PyTorch and is main- tained by Ultralytics [18]. Distributed Data Parallel The process of training ML algorithms on several machines is called distributed training and can be done with different methods. In this text the focus will lie on the DDP method where the data is partitioned among several different workers in a Kubernetes cluster. Common reasons for using distributed training include: • performing training on enormous data sets, which take too much time to go through • when either all the data cannot be accessed or stored on a single machine • when the data is only accessible on different locations • when the data should fit entirely in the memory to be preprocessed A typical training iteration may consist of three steps. First, in forward prop- agation the NN outputs a best guess, generating losses and labels. Second, in backpropagation weights and thresholds are optimized using gradients generated by taking into account values and losses from the best guess. Finally, an optimizer applies gradients to update these parameters. What is specific about DDP train- ing [30] is that the gradients are communicated across nodes before the third step. 6 Figure 2.1: During the DDP process the dataset is split and each partition is processed by a separate worker. The workers compute gradients which are syn- chronized in a main server [41]. Therefore, all model partitions are updated according to the same trend and rate of change. Finally, a local optimizer is applied to each partition. Specifically, during the DDP process the model to be trained is first initialized on a main server. Each Kubernetes worker is responsible for a partition of the training data and downloads a copy of the model, from the main server. At the worker nodes, gradients of the loss function are calculated on the batches of train- ing data for which they are responsible. The calculated gradients are then sent to the main server where the model is updated. The model update can happen either synchronously or asynchronously. During synchronous updates the main server will wait for all workers to finish calculating and sending their gradients. This may result in slower training because the process is limited by the slowest worker. By updating asynchronously, the main server will update the model as soon as it receives a gradient from a worker, eliminating the limitation by the slowest worker. However this method comes with its own trade-offs in the form of potentially worse convergence due to race conditions[41]. Race conditions are 7 flaws that occur due to dependence on timing or sequence of events in a program, leading to various unwanted consequences [49]. In the context of DDP training, a potential race condition can occur when a Kubernetes worker performs an update based on an obsolete model and overwrites the updates of more recent models computed on other workers. This can result in issues with model performance and convergence. Figure 2.2: During synchronous updating the main server waits for all workers to complete their calculations [41]. 8 Figure 2.3: When updating asynchronously the main server updates the model instantly when receiving a computed gradient [41]. 2.2 DevOps Applications, after they are created, need to be made available for clients to use. This involves several steps that can be categorized as either development or op- erations. During development the first step could be to plan and define an appli- cation’s purpose as well as functionality. The next step is to code and create the actual application followed by building and testing to ensure that it functions as 9 intended. However making the application available to customers usually involves going through several steps to deploy it in a different environment, such as a Linux server, and then making sure that it runs without problems. These steps are part of operations and can include preparing servers by installing necessary tools and packages, opening ports and performing network configuration and monitoring the application by examining user data. Figure 2.4: The figure illustrates different DevOps steps in an infinite loop [10]. DevOps attempts to merge development and operations to increase the speed of delivering high quality code, rapidly update deployment and improve collaboration and overall product quality. DevOps is hard to define but is often said to be a set of practises, tools and cultural philosophy. At the center of DevOps is CI/CD which is about automating all the required steps between planning a new application and deploying it.[6] 2.3 MLOps MLOps is the process of automating ML using DevOps methodologies, according to Noah Gift and Alfredo Deza [17]. In practice, MLOps combines ML application development with system deployment and operations to create standardized and automated processes along the ML lifecycle, depicted in Figure 2.5. 10 Figure 2.5: ML lifecycle as illustrated by Neil Analytics [35]. Adopting this approach, a completed framework requires technical and manage- ment principles, that aim to increase velocity and high-quality software creation [17], alongside a strong consideration of best practices. Figure 2.6 depicts these principles linked to fundamental components of MLOps. Figure 2.6: MLOps implementation principles and components as illustrated by Kreuzberger, D., Kühl, N., and Hirschl, S. [29]. 2.4 Containerization Containerization is the process of encapsulating applications and their dependen- cies into a single container that can easily be shipped to other systems [32]. In contrast to VMs, containers share the host kernel and OS and do not require 11 separate virtual ones [33]. The concept relies on building an image from the appli- cation and its dependencies and then pushing it to a remote or local image registry. This image can later be pulled from the registry and used to start the application on another system seamlessly. To automate the build process, a text file named Dockerfile that describes the process can be created [32]. Applications inside containers can act as micro services in a larger framework, which are relatively simple to debug and replace. When it comes to containerization, the most popular and lightweight open-source platform is Docker [13], ranking top tool in the Stack Overflow Annual Developer Survey for 2023 [54]. In the scope of this thesis, a Dockerfile will be used as a recipe to build an image for a containerized ML application. The application is then managed by a Kubernetes cluster and registered by either Docker Hub or a local container registry. 2.4.1 Container Runtimes The container runtime is the software that is responsible for running containers on the host system. It is able to pull images from a registry from which it then starts containers. In more technical terms, a container runtime limits, accounts for and isolates system resources for a collection of processes. Containers operate on top of the operating system, and any virtual machines, and are thus independent of the system hardware. Docker offers one the most famous container runtimes and is widely used by different organizations to enable applications running in different environments, on different hosts and with minimal effort [36]. In this project the Containerd [11] runtime will be used instead of the Docker runtime. The container execution process consists of several steps. First it creates the container and initializes its environment from an image containing dependencies and the actual application to be run. Next the runtime starts the container and ensures that the application is running inside. Afterwards the container runtime keeps monitoring the container and ensures the application is working by restart- ing the container if it fails. To be able to isolate the container the runtime utilizes namespaces and cgroups (control groups) and ensures that processes inside con- tainers do not disrupt host or other container processes [36]. 2.4.2 Container Runtime Interface A CRI is an interface that enables the Kubernetes cluster to communicate with the container runtime, for instance Docker or Containerd, since it cannot directly operate the container runtime. Kubernetes, and its componenets, will be explained 12 in Section 2.5. Some examples of container runtimes that provide a CRI for kuber- netes include Mirantis Container Runtime, CRI-O and Containerd. At the time of writing Docker engine needs the cri-dockerd adapter in order to work with Ku- bernetes. Figure 2.7 illustrates an example where the kubelet manages containers by calling the Containerd runtime, through the CRI, which in turn will start up the specified amount of containers. Figure 2.7: The kubelet (Kubernetes node agent) is a componenet that communi- cates with the container runtime through the CRI [11]. 2.4.3 Container Registry An important feature of containerization is the registry, where one can push and pull their own images [32]. In other words, container registries store any images relevant to a project and can be maintained either locally or remotely. In the context of container orchestration and specifically Kubernetes, a container registry is not necessarily part of the cluster. However configuring communication between a cluster and a registry is a vital component for any containerized project, mainly involving the creation of a Kubernetes secret [66]. These practices will be analyzed more thoroughly within this chapter, as well as in Methods (Chapter 3). Private Registry Setting up a private container registry can offer significant control over a devel- oping project. Firstly, this solution offers full control over storing and accessing container images, which can prove useful for those that want to develop private projects instead of hosting them publicly. Additionally, a developer can set up and configure their private registry in full control and according to project re- quirements. A possible drawback with a private container registry, as with every private project, is the requirement for more comprehensive knowledge and experi- ence in this particular matter. 13 Docker Hub Docker Hub is a public container registry platform for finding and sharing Docker images [12]. It is professionally developed and maintained while also offering many pre-built and community provided images that can be used instead of developing from the ground up. However, this solution is unsuitable for projects intended to be developed internally. 2.5 Container Orchestration Managing large numbers of containers can be difficult. Container orchestrators are designed to handle thousands of containers; automating things like upgrading, starting, stopping, monitoring and scaling. They perform important container life cycle tasks in a small amount of time effortlessly. Container orchestration is based on declarative programming where rather than specifying the steps for the desired output only the output is defined and then the tool achieves it automatically. Container orchestrators can be self built or managed, with this document focusing on the former and its challenges [19]. 2.5.1 Kubernetes (K8s) Kubernetes (K8s) is a versatile open source platform for automating, scaling, man- aging and deploying containerized applications [62]. It provides services, support and tools and ranks high in popularity and usage by professionals according to the Stack Overflow Annual Developer Survey for 2023 [54]. The Kubernetes container orchestrator manifests itself in the form of a cluster that uses its own abstractions to manage micro services. A cluster consists of a control plane (master node) and worker nodes which are governed by the control plane. Each cluster has one main control plane that can have multiple copies for redun- dancy in case the main goes down. Should a worker node fail the control plane will automatically reschedule all processes on another available node, guaranteeing that applications keep running. Applications are managed through pods. A pod is a Kubernetes abstraction which means that it has no meaning by itself and can instead be loosely seen as a refer- ence to a container in the most common cases. The container is the actual object that contains the running application while the pod acts as a wrapper around it. In the context of Kubernetes all applications are put in pods that in turn are scheduled onto nodes. Every pod has its own internal Internet Protocol (IP) address that is used to communicate with it from within the cluster. Pods are the smallest units that can be deployed on the cluster and they are ephemeral, 14 Figure 2.8: Figure illustrating a Kubernetes cluster with four nodes and their main components (One control plane and three workers). [44] meaning that data should never be stored permanently in them since it will be lost if the pod goes down. In more advanced use cases one pod can contain a group of tightly coupled containers that share namespaces and filesystem volumes. One such example is a pod consisting of a main application container and a sidecar container. The Sidecar container is a secondary container that runs in the same pod as the main container and can for instance display or send logs from the main application container.[65] Since every pod has its own unique internal IP address it is inconvenient to use it in applications that access pods. Every time a container is restarted it gets placed in a new pod with a new IP address and can thus no longer be accessed through the old address. Problems also arise when trying to access several pods in a group that comprise a network application or if a single pod application needs several copies of itself. For these reasons Kubernetes uses Services which are abstractions that can be thought of as labels (for groups of pods) with internal static IP addresses. Other applications are configured to access the service rather than the pod(s) thus making them independent of the number of copies or actual pods. 15 Figure 2.9: Figure illustrating Kubernetes pods. The single container pod (Pod 1) is the most common although pods can contain multiple containers and volumes. The illustration is inspired by [69]. Figure 2.10: Figure illustrating two Kubernetes services ”web service” and ”auth service”. Groups of pods are exposed on the network through services [21]. Kubernetes clusters are not meant to store any data permanently. Instead external storage spaces are mounted into the cluster. These could be anything from a cloud storage and storage cluster to a folder on a local computer. Kubernetes uses Persistent Volumes to assign storage that the cluster is allowed to use. Persistent Volumes are cluster resources just as nodes, according to the Kubernetes docs, however it is important to note that they are abstractions and not actual storage volumes so they need to be mapped to a real resource. After defining a storage in the cluster through a PersistentVolume, parts of the storages can be requested with PersistentVolumeClaims that define how much of the storage a pod can use as well as access modes. Applied PersistentVolumeClaims will automatically find suitable PersistentVolumes in the cluster and bind to them. According to the Kubernetes docs pods can be seen as consuming node resources while PersistentVolumeClaims 16 consume PersistentVolume resources [70]. Figure 2.11: The external storage is outside of the cluster. It is defined in the cluster with a PersistentVolume (pv) and then a PVclaim (pvc) is bound to it. The PVclaim defines how much storage the pods can consume. [5] Essential concepts and components related to a functional cluster This subsection will go into some important components that play vital roles in a Kubernetes cluster. • Swap Memory Swap is a preconfigured space on the hard disk, where a page of Random- access Memory (RAM) is copied [31]. However, according to the official documentation an important requirement for kubeadm is the deactivation of swap space [60], which has created debate for Kubernetes communities in the past. Even though a beta support for swap memory on Linux has been released since 2023, according to the documentation of the same release, activated swap can pose risk to environments subjected to performance con- straints [22]. Therefore, this work follows the official protocol by deactivating swap every time kubeadm is used. 17 • Nodes A Kubernetes node is a physical machine (server/computer) or virtual ma- chine that hosts pods. All nodes have a kubelet, container runtime and a kube proxy. Nodes can be categorized, for instance into cpu intensive or memory intensive node groups. Nodes are never supposed to save any type of data internally since it will be lost if the node fails. • Pod Network Add-ons A Kubernetes pod network add-on can provide different functionalities how- ever as a minimum they allow nodes in a Kubernetes cluster to communicate with each other. Pod network add-on’s route traffic between hosts but can also facilitate load balancing and service discovery (automatically detecting devices on a network) [7]. This project uses Flannel [16] which is a sim- ple pod network add-on that focuses on handling traffic between nodes and not between individual pods. It is a layer 3 IPv4 network that provides temporary IP addresses (subnet lease’s) to all the nods in the cluster which eliminates the need to configure ports. Another example of a more advanced pod network add-on that also provides network policy is Calico [72]. • Kubelet This is an important cluster component that acts as a node agent with sev- eral key responsibilities. The kubelet runs on all the nodes and executes commands from the control plane, serving as the link between worker nodes and master node. It is fundamentally responsible for managing nodes, mak- ing sure they are healthy and running and executing pods [4]. The kubelet can loosely be imagined as the ”brain” of each node that invokes the use of different ”limbs” such as the container runtime for starting containers. • etcd To be able to track different states of a cluster, Kubernetes implements the etcd key-value storage system. It stays consistent thanks to the Raft Algo- rithm [45] which, in short, elects a leader node that contains a value agreed upon by the rest of the cluster. The keys are usually resources while the values represent states. In the context of Kubernetes the etcd stores in- formation such as current cluster state, desired cluster state, configuration resources and runtime data. This allows the control plane to make appropri- ate adjustments, based on the information in the etcd, to ensure that nodes are not overloaded and that tasks get scheduled on nodes with the required resources [3]. 18 • kubeadm Kubeadm is a bootstrapping tool built to start a minimum viable Kuber- netes cluster according to best practises. It starts the kubelet service and deploys the required pods on the control plane etc. automatically when run- ning kubeadm init. For easy cluster resetting and joining worker nodes the kubeadm reset and kubeadm join commands are provided. Kubeadm does not install ”nice to have” features such as the Kubernetes dashboard. [60] • kubectl To interact with the Kubernetes control plane a command line tool called kubectl can be installed. Everything from querying number of nodes or pods to starting new jobs or deployments can easily be done with kubectl. The tool will by default look for a configuration file in the ∼/.kube/ directory, that contains information on where the control plane is hosted and how it can be accessed. All cluster communication is conveniently done with kubectl. [57] • Kube Controller Manager Kubernetes uses a collection of different controllers to monitor the state of the cluster. These are essentially infinite loops that watch for changes and attempt to bring the current state of an object to the desired state. One example is the Node controller which ensures that nodes are healthy and if not the workload is moved to another node. Another example is the repli- cation controller which monitor the number of pod replicas and attempts to always keep the desired number of pod replicas running. There are many more controllers in Kubernetes and all of the core controllers are embedded in the Kube-controller-manager, which is a single binary file. [73] • Cloud Controller Manager If you are using cloud services in your Kubernetes cluster the cloud provider will require its own logic, for instance when setting up a node. The cloud- controller-manager embeds cloud-specific control logic and essentially links your cluster to the cloud providers API [56]. • Kube API Server The kube-api server exposes the kubernetes Application Programming Inter- face (API) to end users and different parts of the cluster that use the API. Operations on the cluster are performed through the Kubernetes API. Tools like kubectl and kubeadm use the API [68] [63]. 19 • Kube Scheduler This is a process running on the control plane that is responsible for match- ing pods with nodes. Newly created pods are initially unscheduled so the kube scheduler automatically places the pod on an appropriate node, taking into account resource requirements, node workload and more. The kube- scheduler continuously checks for unscheduled pods and can reschedule pods onto other nodes in the case of node failure. Kube-scheduler distinguishes itself from the kubelet in that it only schedules pods while the kubelet runs the pods [53] [61]. • Kube Proxy Kube-proxy runs on all the nodes in the cluster and maintains a network routing table that maps service IP addresses to pod IP addresses. When a service is created with a corresponding set of pods, the kube-proxy compo- nent will make sure that all traffic incoming to the service will be redirected to the pods in the service. The key responsibility of the kube-proxy is to watch for changes in Kubernetes cluster services and translate these changes into network rules [27]. • Pods Pods are the smallest deployable objects in a Kubernetes cluster. They can contain one or several closely related containers and each pod has its own internal IP address that is lost when the pod goes down. • Services Services expose groups of pods on the network and provide a static IP ad- dress that can be used to access pods, and their replicas, connected to the service. It also facilitates pod load balancing. • Deployments Deployments ensure that pods and their replicas keep running indefinitely. If a pod goes down a new will be created to keep the desired number of pods always available. An NGINX server is an example of an app that is usually run as a deployment. • Jobs Jobs, contrary to deployments, run tasks that are not active indefinitely. A job will keep retrying to execute its pods until a specified number of them successfully terminate or the maximum amount of retries is reached. An example application that could be run as a job is ML model training. When 20 the model has been successfully trained the job finishes and its logs and re- sults are saved. • Persistent Volumes Persistent volumes are storage resources in the cluster. They need to be mapped to an actual storage such as a directory on a computer or a cloud storage. • Persistent Volume Claims Persistent Volume Claims bind to Persistent Volumes and consume available storage in the cluster. They are used by pods. • Secrets These are objects that contain sensitive data. For instance, when trying to access private registries the log-in credentials can be stored as secrets in the cluster to be used during container creation and image pulling. Secrets are not encrypted by default and anyone with access to the cluster API conse- quently has full access and control of all secrets. • Token A bootstrap token or simply token is a means of authentication for a Kuber- netes cluster. Unless configured or otherwise specified, a token is generated during the initialization of a new cluster and can be used when joining nodes to an existing cluster. A token is a specific type of secret and typically has an expiration date, after which a new token can be generated for relevant authentication processes. The token is a string that must satisfy the regular expression [a-z0-9]{6}.[a-z0-9]{16}, a string comprised of six alphanu- meric characters separated by a dot from sixteen alphanumeric characters. Analytically, the first part is considered public information and, after the dot, the second part is secret. Kubernetes Dashboard The Kubernetes dashboard is a web-based Graphical User Interface (GUI) pro- viding an overview of everything related to a Kubernetes cluster [59]. It offers a user-friendly visual representation of resources and applications in the cluster. Volcano and TorchX Volcano is a high performance system built on Kubernetes. It provides batch scheduling, that is often useful when running DDP workloads on Kubernetes clus- 21 ters, and is integrated with popular libraries and frameworks [77] as illustrated in Figure 2.12. Combining PyTorch Lightning, Volcano and TorchX allows for seamless DDP training on Kubernetes. Volcano has the ability to schedule groups of pods (PodGroups), which is required for the job launcher, TorchX [42], to work with Kubernetes. PyTorch Lightning is a lightweight deep learning framework assisting with ML development and scaling [15]. Among other functionalities, it provides the ability to handle device placement and run an ML model on different servers and, when combined with Volcano, it enables distributed training across multiple nodes within a Kubernetes cluster. Figure 2.12: Volcano works on top of Kubernetes [76]. 2.6 Miscellaneous This section provides a descriptive background of the tools and methods pertinent to this thesis, which contribute to the overall research. Although these topics are either already widely discussed or do not fit neatly into the Sections above, they are deemed essential for the effective outcome of this thesis work. These include version control, virtualization and file sharing. 2.6.1 Version Control (Git) Git is a significantly popular and powerful tool for storing and processing data, making it unique in comparison to other version control systems [9]. Specifically, Source Code Management (SCM) ensures code storing and data versioning for any framework, including ML projects. Thus, each version of the code and the data is documented and shared among collaborators [26]. In the context of this thesis, 22 the relevant source code repositories created with git and hosted on GitHub are accessible in Appendix A. 2.6.2 Virtualization (VirtualBox) While having many physical servers can prove computationally powerful, it is also resource-demanding. A popular approach to this issue is virtualization. Virtual- ization is achieved with a hypervisor, a software layer enabling the deployment of multiple virtual machines, each with its own operating system, on one physi- cal server [33]. A VM is an environment that can host an operating system with libraries and applications managed by the hypervisor. A simple type of this ar- chitecture is illustrated in Figure 2.13. In the scope of this thesis VirtualBox, a popular open source and professional virtualization software solution, was chosen [37]. Figure 2.13: A type of virtualization. 2.6.3 File Sharing (Samba) Samba is the standard interoperability suite for Linux and Unix systems, facilitat- ing seamless integration with Windows environments [8] [71]. It facilitates com- munication between different operating systems by establishing a uniform network protocol. Due to its availability and compatibility with many popular open source 23 software systems, Samba is used by numerous corporations and organizations of all kinds, as well as private users [71]. In the scope of this thesis, Samba is used for sharing directories between different servers that coexist in the same network by utilizing standard Transmission Control Protocol/Internet Protocol (TCP/IP) networking. 24 Chapter 3 Methods This chapter discusses a comprehensive methodology of the pursued DevOps prac- tices. It includes a thorough explanation of global evaluation criteria and the work built around them, aiming to build a pipeline on physical hardware and virtual machines with containerized features and ML applications. 3.1 High Level Requirements To integrate ML with DevOps, one must regard numerous available options and form general selection criteria. To approach this matter, four high level require- ments should be considered. Automated and Reproducible Work All processes should involve as less manual interaction as possible, while being reproducible from the ground up. This involves detailed documentation of all practices, including errors and solutions, to increase accuracy and efficiency. Three primary methods utilized for automation during this thesis are docker containers, shell scripts and configuration files. Open Source Software Software that is compatible with this thesis should be selected to offer accessibil- ity and community support. Therefore, only open source material is considered, preferably under MIT [34] or Apache [2] licences. 25 Frameworks for Distributed Training and Load Balancing Even though it is beneficial to consider single node training as an initial task, the main goal of this thesis should incorporate distributed training of ML models, since many models are complex or require large datasets. Consequently, it is also deemed necessary to consider software for load balancing computational resources, optimising performance and scalability. Version Controlling To enable progress tracking, usage of Git as a version controlling system is deemed necessary. As stated in Section 2.6.1, this assists with collaboration and tracks the progress of all ML related tasks. The repository containing all material relevant to this thesis is located in Appendix A. 3.2 Simple CI Pipeline on GitLab using CML To gain familiarity with the concept of CI/CD in the context of DevOps , a simple pipeline can be built on GitLab using an introductory example of CML [24] which is an open source ML tool presented in Section 2.1. This model is a random forest classifier using the scikit-learn Python library [40], for which the training and testing features and labels are provided in the same repository. Since GitLab has all of the required infrastructure only a single CI configuration .gitlab-ci.yml file is needed. The contents of .gitlab-ci.yml can be found in Appendix B. The train-and-report job involves only one step, for which the Docker image and the script are provided by iterative.ai [14]. Inside the CML container all the required Python packages are installed via pip before the ML script train.py is run. With the completion of a pull or merge request in the repository the pipeline is triggered automatically, presenting logs along the process. After successful com- pletion two artifacts are generated and saved for future reference: one simple text file containing the model accuracy and a plot representing the confusion matrix. 3.3 Kubernetes pipeline With an initial comprehension of a practical CI/CD implementation a reasonable next objective is setting up a Kubernetes cluster to support the DevOps pipeline. This alleviates the constraint of the pipeline being solely on GitLab. 26 3.3.1 Infrastructure Overview: Servers and Network Although Kubernetes is a very useful tool, it requires experience with configuration and setup. A primary infrastructure overview is therefore crucial in understanding a Kubernetes cluster. Setting a network layout can play a major role in resource allocation, load balancing as well as debugging. Figure 3.1 illustrates a network diagram of the cluster, which was set up during this thesis. The central router is the primary networking device and is connected to an external network. Figure 3.1: Network topology diagram. Physical Servers Two available servers are connected to the router via Ethernet cables, named main and support server in Figure 3.1. Each server includes one Graphics Processing Unit (GPU), which can be utilized for faster performance in ML projects. The main server is used for hosting virtual machines and the support server acts as a Kubernetes worker node. The chosen OS was Ubuntu Desktop 22.04 LTS on both servers. 27 Virtual Machines The main server hosts four virtual machines, as shown in Figure 3.1. As described in Section 2.6.2, they are all deployed on the main server hardware, but each with its own OS. The VMs are managed by VirtualBox 7.0.14 and for each VM the guest OS was chosen to be Ubuntu Server 22.04 LTS. Each VM plays a separate role, presented in Table 3.1 which also shows the chosen specifications. A more detailed description of their usage is provided later in this chapter. Lastly, the selected network attachment is ”Bridged Adapter” which makes every VM visible in the same network as the physical servers. Name Role RAM (MB) Disk Space (GB) CPUs master Kubernetes control plane 8192 40.00 4 worker Kubernetes worker node 8192 40.00 2 storage SAMBA storage 8192 50.00 2 registry container registry 11264 100.00 4 Table 3.1: VM specifications. 3.3.2 Requirements for Setting up a Cluster with kubeadm Static IP Addresses Resolving static IP addresses for each server is not a necessary prerequisite. In fact, during the longest part of this thesis work, IP addresses were dynamic. However, to ensure stability, automation and reproducibility, configuring static IP is a key practice. Each server on the Local Area Network (LAN) in Figure 3.1 is manually assigned a specific IP address that remains constant. The configuration involved Dynamic Host Configuration Protocol (DHCP) to automatically obtain the Media Access Control (MAC) address and then set a static IP address for each server. As mentioned above, since every VM network attachment is ”Bridged Adapter”, every IP address is received from the same DHCP as the physical servers. Thus, all the physical and the virtual servers are available on the same LAN. Resource Units Some errors, when initializing a Kubernetes cluster especially in VMs, can occur due to resource shortages. For example, a common erratic practice is to set the number of Central Processing Units (CPUs) to 1 instead of 2, which produces the error below. This can be easily avoided by setting up VMs that satisfy the lower resource limits set by the official documentation [67]. [ERROR NumCPU]:the number of available CPUs 1 is less than the required 2 28 CRI Configuration Before starting the control plane, a CRI needs to be selected, downloaded, in- stalled and configured. After experimenting with CRI-dockerd and Containerd, the latter was chosen as the main CRI in the pipeline, as it seemed to be preferred in the official Kubernetes documentation [64]. During this process, two bugs were encountered. In the beginning, the control plane failed to initialize and instead re- turned a ”Kubelet is not running” error. In an attempt to resolve this, system logs were studied and it was eventually concluded that the swap memory was not disabled, which is an important requirement of Kubelet. After disabling swap the control plane initialized successfully for a few minutes before throwing a ”connection refused” error when using any kubectl com- mand. This time, the actual error could not be located through the system logs and after spending a considerable amount of time on troubleshooting, the problem was solved by modifying the Containerd configuration file. Pod Network Add-on In order for the control plane to be able to communicate with pods on other nodes a pod network add-on is required [58]. This routes the traffic between pods and allows them to send and receive traffic. Initially, Calico was tried, as it is one of the most popular pod network add-ons. However it was abandoned due to unnecessary functionality in favor of a much simpler pod network add-on called Flannel [28], which could be easily applied to the cluster with a single command in the control plane. Specifically, one can successfully deploy Flannel by adding the flag --pod-network-cidr with the value 10.244.0.0/16 to the kubeadm init command and, after setting up the Kubernetes admin configuration file, running the following command according to the official Flannel documentation [16]: kubectl apply -f \ https://github.com/flannel-io/flannel/releases/latest/download/kube-flannel.yml Static Bootstrap Token A growing concern during the time span of this thesis was: ”What happens to the cluster if a server fails?”. The answer is that the whole cluster has to be reset again after the servers have been restored, which is a time-consuming process. One step in automating both the kubeadm init and kubeadm join processes was to specify a static bootstrap token, as a part of the authentication between the con- trol plane and the workers. As described in Section 2.5.1, the token is typically a random string and a part of the kubeadm join command, which is generated dur- 29 ing kubeadm init. However, by adding the flags --token followed by the desired value and --token-ttl followed by 0 to the kubeadm init command, the token is set to the chosen value without expiration. On the worker node the kubeadm join command must also include the --token flag followed by the specified to- ken value along with the --discovery-token-unsafe-skip-ca-verification flag. The latter allows a server to join the control plane without the need of the --discovery-token-ca-cert-hash flag. The full kubeadm init and kubeadm join commands are presented below in Section 3.3.3. 3.3.3 Setting up a Cluster with kubeadm To create an initial cluster all nodes had to be virtual machines, as described in Figure 3.1 and Table 3.1, due to lack of resources. Control Plane The VM that hosts the control plane is named ”master”. To facilitate the first initialization of the control plane a custom shell script was created, which au- tomatically downloads all required dependencies for Kubernetes, configures the Containerd CRI, initializes the control plane and applies the Flannel pod network add-on according to the requirements in Section 3.3.2. The full kubeadm init command is: sudo kubeadm init \ --pod-network-cidr 10.244.0.0/16 \ --token [a-z0-9]{6}.[a-z0-9]{16} \ --token-ttl 0 Worker Node A second VM named ”worker” was created as a Kubernetes worker node. The worker node was easily set up by using a modification of the control plane setup script that only installs the required dependencies and joins the master node with- out the need of adding any pod network add-on. The complete kubeadm join command is: sudo kubeadm join master:6443 \ --token [a-z0-9]{6}.[a-z0-9]{16} \ --discovery-token-unsafe-skip-ca-verification At this point a minimal K8s cluster with one master node (control plane) and one worker node had rapidly been created and was successfully running. Afterwards, 30 work on actual pipeline components could start. Automating Cluster Restart An important component of this thesis has been the Bash shell due to its ability to reduce manual interaction with the terminal. What is more, if a VM already has all dependencies installed and is ready to be a Kubernetes node then a good practice after reboot is to run a modified script. Taking into account the requirements in Section 3.3.2, two shell scripts were created which revert any previous changes made by kubeadm init or kubeadm join and then proceed to reset the cluster. These scripts were configured to run at user login by adding their complete path as the last line in the file .bashrc. 3.3.4 Crafting the First Pipeline Component The next step was to create ML training and testing micro-services that would be solely responsible for training an ML model and testing it on a specific dataset utilizing the Kubernetes cluster. However, before this could be achieved, a suitable model with corresponding training and testing scripts had to be found, successfully run on local machines, containerized with Docker and finally integrated with the K8s cluster. After some research a GitHub repository containing YOLOv5 was cloned and modified to run locally [52]. With this repository used as a base, a custom dataset of forklifts and people served as training data and then the resulting trained model was used to perform testing (detection) on a single mp4 video. Due to limited re- sources at the time the number of epochs and batch size had to be significantly decreased. Although this simplified the debugging process significantly, it yielded results of lower quality. However, since the prediction quality was considered out of the project scope, no further effort was put into improving it. To containerize the training and testing applications and turn them into micro- services two Dockerfiles were made using the official Python:3 image from Docker Hub as base image. For each of the applications a shell script, only for use outside of K8s, was made that builds and runs the respective Docker image with the right commands. During the build stage the Ultralytics’ YOLOv5 GitHub repository is cloned and all the required Python packages are installed [18]. A local directory is then mounted into the container, where the retrained weights will be saved after training completion. For both training and detection the build stage involves copying the configuration 31 file data.yaml into the container. This file specifies the location of the training and testing data as well as the object labels ”forklift” and ”person”. When the training application runs the YOLOv5 is trained on the custom dataset. Initially, when attempting to run the docker container, using the default run command out- puts an error stating that the container could not run successfully due to limited shared memory. This error can be solved by modifying the run command that starts the container. This is done by adding the flag --shm-size=10g which sets the shared memory space inside the container to 10 gigabytes thus successfully retraining the YOLOv5 model as a micro-service. After training the resulting weights file best.pt is saved, copied into the mounted directory and later used to detect forklifts and people during testing. The testing is performed on each frame of a single video file, the output of which is eventu- ally stored in the mounted directory, as well. The Dockerfile and the shell script for testing are quite similar to the respective files for training with only very few changes. This proves time efficiency and cohesiveness across this ML project. Containerizing these ML applications successfully was an important step towards automation. Before integration into the Kubernetes cluster however a data stor- age directory needed to be created along with a container registry from which the cluster pulls the custom training image for pod creation. 3.3.5 Creating External Storage To create an external file storage for the K8s cluster a VM named ”storage” was created according to the description in Section 3.3.1. After installing a Samba server and creating a small configuration file specifying access privileges the VM was able to share a directory locally on the network. This shared directory had to be mounted on the worker node which posed an issue since the IP address of the storage VM was dynamically allocated during the beginning of this thesis. To solve this issue a custom mounting script was made which used nbtscan to resolve the IP address from the Network Basic Input/Output System (NetBIOS) name of the storage VM. With the shared directory mounted on the worker the external file storage could be abstracted into the K8s cluster. This was done by applying a PersistentVolume type configuration file to the control plane that made 10 gigabytes of storage from the shared Samba directory accessible by the K8s cluster. 32 3.3.6 Configuring Container Registry The registry VM hosts the private container registry locally. The configuration installs docker-registry from docker.io on the VM and authenticates with spec- ified username and password. What is more, the registry should be set as insecure in the docker daemon on each relevant node, including the registry, and the reg- istry port enabled on the firewall. A major obstacle that slowed down the progress of this thesis was dynamic IP addresses. It became a problem since the authors did not have full network priv- ileges. Both the docker daemon configuration and the Containerd configuration need to include the IP address of the registry which in turn introduces severe in- conveniences when the IP changes. Due to this a decision was made to set up a separate router for this project and configure static IP addresses for all the VMs, as described in Section 3.3.1. Afterwards, building and pushing the YOLOv5 training and testing images to the local Docker registry is possible from any node connected to the local network and logged in to the registry. For the K8s cluster to access the image through the Containerd CRI a cluster secret containing the local registry credentials has to be added to the cluster. The secret is added by applying a yaml configuration file on the control plane. From the secret both the username and password can be extracted as a base 64 string and added to the Containerd configuration file. 3.3.7 Running the First Pipeline Components in a Kuber- netes Cluster With the cluster being able to access an external storage and external container registry the YOLOv5 training and testing micro-services can be integrated into the cluster. This can be achieved by firstly creating a PersitentVolumeClaim, which defines a portion (3 gigabytes) of the PersistentVolume described in Section 3.3.5, and a Job. These are separate con- figuration files that are applied to the cluster through the control plane. In the training Job file it is necessary to specify a volume mount to a shared memory directory inside the container. This ensures that the container is run- ning with more shared memory and thus without getting killed. The detection Job is almost identical to the training Job. The main difference is calling an- other function from the same YOLOv5 repository during execution. This function uses the mp4 video located in the PersistentVolume and accessed with the same PersistentVolumeClaim as the training Job. When this Job is run it will attempt 33 to detect forklifts and output the result in the pod logs. 3.3.8 Kubernetes Dashboard To facilitate monitoring the cluster the official Kubernetes dashboard can be in- stalled by applying configuration files provided by the Kubernetes team. This creates a new cluster namespace with two pods running in it. At this point an error can result in the pods never becoming ready. Troubleshooting for this error involves checking the pod logs with kubectl. To resolve this issue try removing a container network interface of type ”bridge” on the hosting node. The K8s service named dashboard-service has to be edited to be of type NodePort rather than Cluster IP. This enables access to the dashboard from a browser on the local network by typing in the IP address of the master node. To be able to login via the dashboard a service account, cluster role binding and a secret have to be created by applying three configuration files on the cluster. After this a token is generated with a single command which is used to login on the dashboard. [50] 3.3.9 Distributed Training in a Kubernetes Cluster To introduce further functionality to the K8s cluster a second worker node can be created to enable DDP training. To be effective both the physical server, where the training is initiated (client), and the cluster need to be configured accordingly. By following the method below the client does not need to be a node in the K8s cluster rather it could be any physical server. On the cluster side Volcano [77] is installed by applying configuration files provided by the Volcano team. On the client side a regular Python virtual environment is set up with a small project containing some example code. To be able to use TorchX to schedule jobs on Kubernetes the Kubernetes Python client package and TorchX must be installed with pip beforehand. In order for TorchX to find the K8s cluster and interact with it, it is necessary to copy the kube configuration file from the master node, generated during kubeadm init, and place it in a new directory ∼/.kube/ on the client. This is the default directory, where TorchX ex- pects to find the configuration file. The TorchX launcher is a tool that can be used to schedule DDP training on the cluster. It uses a default docker image and pulls it in the current client workspace. The image is updated with the existing components of the workspace, such as pip packages and Python code, and then pushed to a specified registry. Before run- ning TorchX, a configuration file has to be made that specifies Volcano queue and image repository to upload the modified image to. Then it is built and pushed to 34 the specified container registry, in this case Docker Hub. After logging in to the registry with the option ”docker login”, example scripts provided by TorchX [42] and PyTorch Lightning [43] can be executed on the K8s cluster using a single TorchX run command with Kubernetes specified as the scheduler argument and the Python script as a regular argument. Since the job is launched by TorchX and scheduled on Kubernetes via Volcano the pods in the cluster pull the edited image, containing the workspace, and execute the distributed training. PyTorch Lightning manages the way training is distributed among processes. 35 36 Chapter 4 Results This chapter articulates the result obtained by following everything that was dis- cussed in Methods, Chapter 3. This mainly includes the artifacts obtained by the CML pipeline and the architecture of the final pipeline using Kubernetes. 4.1 Simple CI Pipeline on GitLab using CML In this section, the results of the simple CML model are presented. As described in Section 3.2, upon triggering, the Continuous Integration (CI) pipeline generates job logs alongside two artifacts. Figure 4.1 represents a successful output of the pipeline logs on GitLab, completed in 36 seconds. Figure 4.1: CI pipeline logs on GitLab using the simple CML model. The first artifact is a .txt file comprising of the model accuracy, the contents of which are shown below, and the second artifact is the confusion matrix of 37 the model, depicted in Figure 4.2. As expected, the short CI pipeline is rapidly executed and the artifacts prove the model’s high accuracy. Accuracy: 0.864 Figure 4.2: Confusion matrix depicting the performance of the simple CML model, saved as a CI artifact. 4.2 The Finalized Pipeline The pipeline is supported by a K8s cluster with a storage mounted to the cluster and accessing a container registry. Each function in the pipeline is seen as a mi- croservice and realized in the cluster as an application running inside a container. The final k8s cluster consists of a master node (control panel) and two worker nodes that all run on separate virtual machines hosted by VirtualBox. All the pods that comprise the pipeline are exclusively run on the worker nodes. Each pod uses only cluster memory, without storing any data locally, unless specified to store an artifact in the mounted storage. Specifically, the cluster memory is a shared folder on another virtual machine. This external memory is mounted into the cluster and internally accessed through a Persistent Volume configura- tion. Each pod is then accessing parts of the mounted storage through Persistent Volume Claims. The Docker images running in the pods are stored in a container registry. Un- like the storage, the container registry is not mounted to the cluster. Instead, docker images are directly pushed and pulled from it as long as the pods have access to its IP address. Pods authenticate to the container registry with a K8s 38 Secret, applied through a configuration file. A problem with the local registry is that it is crashing so to ensure a seamless process Docker Hub is used instead. Figure 4.3: Parts of the pipeline used for object detection. 4.2.1 Object Detection Training and Testing There are two microservices running in the cluster, one that trains the YOLOv5 model on a custom dataset and one microservice that performs detection on a video using the result from the training microservice, as described in Section 3.3.4. Both are based on the same code repository which is downloaded into the Docker container at startup. All training and testing scripts are included in the repository. Once the model has been trained on a custom dataset the resulting weights are stored in the shared folder and later accessed by the detection container. The full structure of the active parts of the K8s cluster is depicted in Figure 4.3. 4.2.2 Distributed Data Parallel Training As described in Section 3.3.9, distributed training is achieved using PyTorch Light- ning and then launched with TorchX. By storing the kube configuration file locally and downloading the K8s Python package, TorchX can automatically schedule the training on the K8s cluster after Volcano has been installed. TorchX handles the scheduling while K8s ensures that processes are distributed across the resources in the cluster. This combination has proven very time and resource efficient with- out demanding the complex implementation characteristic to DDP training. The structure of the pipeline used for DDP training is illustrated in Figure 4.4. Figures 4.5 and 4.6 show the difference in training time when training on one compared to two nodes. 39 Figure 4.4: Parts of the pipeline used for DDP training. Figure 4.5: It almost took an hour to train on the MNIST data set with one worker. 40 Figure 4.6: Training on two workers significantly reduced the training time to approximately 3 minutes. 41 42 Chapter 5 Discussion This chapter provides a critical overview of this thesis work, including inhibitory factors to its progress, as well as ideas for possible improvements. It compares the perceived differences between DevOps and MLOps and discusses how certain tactics affected the overall outcome of this thesis. Moreover, this chapter proposes software alternatives and solutions to issues regarding the advancement of the final pipeline. Initially, the focus of this thesis was on creating CI/CD pipeline components for MLOps. However, the interest shifted to building the pipeline infrastructure using DevOps. As the authors of this thesis had no previous experience with neither DevOps, MLOps nor container orchestration, creating and configuring a function- ing on-premises Kubernetes cluster from scratch proved to be quite a challenge. In addition, the time constraints and lack of Kubernetes expert advice certainly lim- ited the outcome. Despite this, eventually a fully functional K8s cluster with some sample features was created with relatively few and inexpensive resources. The authors believe that this project can serve as an introduction to basic container orchestration with Kubernetes. It also has a lot of potential for improvement and further work could include adding more nodes and replacing the virtual machines with physical machines to achieve the full intended potential, as well as adding more features to the cluster. Using a better storage such as a cluster instead of a local directory is also recommended. With a real on-premises K8s cluster the owner has full control and can use it with sensitive data that is not suitable for third party providers. Having argued that Kubernetes is so difficult to set up, the reader might won- der: Why should one not just use a tool with much simpler built-in CI/CD, such as GitLab? While this is certainly sufficient for many use cases, it has limited func- tionality when working with a lot of data and traffic. What is more, creating an 43 internally licensed project, that involves high-level customisation with open source software, is indeed demanding. The authors realized early on that with Kuber- netes, there are almost no limits to scalability. This is especially true when Kuber- netes is built on bare metal. Many DevOps developers and engineers argue that having a Kubernetes cluster on physical machines can prove to be a powerful tool and that frameworks offering automated cluster configuration lack customisation. Kubernetes can also be used beyond building CI/CD pipelines since container or- chestration is inherently a broad practice. On the other hand, they also warn that usually experience is a prerequisite in order to achieve potent customisation. Most enterprises prefer cloud providers because they offer accessibility, easier configu- ration and abundant resources. The most popular cloud providers for Kubernetes are Amazon Web Services (AWS) and Googles Kubernetes Engine (GKE), which require paid subscription. Thus, they are deemed unsuitable for this thesis. Another interesting point of discussion is the choice of Kubernetes in the con- text of ML pipelines instead of new and established open source MLOps tools such as Kubeflow, MLflow or Airflow. These tools were investigated to determine their value for this thesis work. Although they abstract away the complexity of managing ML workflows with a plethora of pre-integrated ML components and pipeline management, they lack in flexibility and maturity. They are optimized with sole focus on developing and engineering ML, which in turn restricts users to code notebooks. And, although they are efficient for their purpose, they are not deemed sufficient thesis material. These are the main reasons why the simple CML pipeline has not been further developed and adapted. On the other hand, Kuber- netes is a well-documented and long-established software tool adopted by several industries worldwide, which also allows precise control over any infrastructure. Therefore, knowledge and experience with Kubernetes is considered a worthwhile investment, despite the difficulties. To make this result resemble an end-to-end MLOps pipeline more steps such as data collection, data cleaning, feature engineering and model deployment should be added. Implementing web API was considered and would have been used to interact with a deployed model through a web browser. The idea was to turn the containerized micro-services into applications that could be deployed with a web API. As a candidate, FastAPI [46] was mainly considered among other software solutions. This part was not implemented due to lack of time. Instead, the focus was shifted to testing and refining already existing work. As already mentioned above, an important improvement is considering a more current and suitable storage option for the cluster. Using a local folder as exter- 44 nal storage is not considered appropriate for a production cluster. Instead, the authors recommend exploring CEPH [78]. CEPH is an open source distributed storage system with scaling capability. It requires more hardware to maintain the storage cluster, as well as more effort to set up and maintain, but provides reliable storage as a result. If one does not need to store any sensitive data and wants to avoid setting up and maintain their own storage cluster, the authors of this thesis recommend using cloud storage providers. Although these services can sometimes be expensive and are provided by a third party, they are very convenient because the user never has to worry about administration or maintenance. In this project, the Flannel pod network add-on was chosen, because it is rela- tively small and simple. However in more advanced clusters that require network policies, other add-ons should be used, for instance Calico. SMILE IV Project Contribution In collaboration with Infotiv Technology Development, this thesis project is cur- rently integrated into the SMILE IV project. With the authors’ assistance, Info- tiv Technology Development launched a simulation on the Kubernetes cluster as a Deployment and created pods that can collect data from any specific camera mounted in a virtual warehouse with forklifts, boxes and people. Appendix A provides access to the code repository for the SMILE-IV project. For the SMILE IV project, a Kubernetes cluster can be a valuable solution. In particular, the simulation is CPU intensive because it is based on a model that processes many frames per second for multiple cameras. In the next phases of the project, the goal is to collect information from processed frames, perceive the environment, and assist agents in navigating to specified locations. Therefore, this process can be distributed across many physical machines hosting a Kubernetes cluster. 45 46 Chapter 6 Conclusion We began this thesis with the hope of exploring and researching different imple- mentations of MLOps. Nevertheless, our interest quickly shifted towards the more established DevOps, in particular Docker and Kubernetes, and utilizing them to build a pipeline with ML code. Thus, we discovered the benefits and difficulties that follow a more sophisticated setup, without losing the ML element. Setting up a K8s cluster from scratch requires time and effort, especially for young developers and engineers. It is not analogous to downloading and installing a single software application for immediate use. A K8s cluster should ideally have one or several administrators who are responsible for setting up the cluster, maintaining it and adding resources to it such as nodes and persistent volumes. Then the devel- opers who create the applications are responsible for integrating their applications into the cluster by writing configuration files. As an alternative to setting up an on premise cluster, there are cloud providers that offer ready to use solutions and handle all the administration. These can however be expensive and do not come with full control over the cluster. To conclude, DevOps are immensely powerful and popular tools, which are not limited to ML, although they support it impeccably. On the other hand, MLOps should be used for smaller ML projects and should most definitely be combined with established DevOps. In this sense, both DevOps and MLOps make the future of ML look structured and cost-effective. 47 48 Bibliography [1] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefow- icz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Watten- berg, M., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org. [2] Apache License, Version 2.0. https://www.apache.org/licenses/ LICENSE-2.0.txt. Accessed: Tuesday 11th June, 2024. [3] armosec. What is etcd? https://www.armosec.io/glossary/ etcd-kubernetes/. Accessed: Tuesday 11th June, 2024. [4] armosec. What is Kubelet? https://www.armosec.io/glossary/ kubelet/. Accessed: Tuesday 11th June, 2024. [5] Ashish Patel. Kubernetes — Storage Overview — PV, PVC and Storage Class. https://medium.com/devops-mojo/ kubernetes-storage-options-overview-persistent-volumes-pv-claims-pvc-and-storageclass-sc-k8s-storage-df71ca0fccc3. Accessed: Tuesday 11th June, 2024. [6] atlassian. What Is DevOps? https://www.atlassian.com/devops. [7] avinetworks. Service Discovery Definition. https://avinetworks.com/ glossary/service-discovery/. [8] Carter, G., Ts, J., and Eckstein, R. Using Samba, 3rd Edition. O’Reilly Media, Inc., 2007. [9] Chacon, S., and Straub, B. Pro git. Apress, 2014. [10] Conscia. DevOps. https://conscia.com/se/losningar/ digital-transformation/devops/. [11] Containerd Authors. Containerd. https://github.com/containerd/ containerd/tree/main. Accessed: Tuesday 11th June, 2024. [12] Docker, Inc. Docker Hub. https://hub.docker.com/. 49 https://www.apache.org/licenses/LICENSE-2.0.txt https://www.apache.org/licenses/LICENSE-2.0.txt https://www.armosec.io/glossary/etcd-kubernetes/ https://www.armosec.io/glossary/etcd-kubernetes/ https://www.armosec.io/glossary/kubelet/ https://www.armosec.io/glossary/kubelet/ https://medium.com/devops-mojo/kubernetes-storage-options-overview-persistent-volumes-pv-claims-pvc-and-storageclass-sc-k8s-storage-df71ca0fccc3 https://medium.com/devops-mojo/kubernetes-storage-options-overview-persistent-volumes-pv-claims-pvc-and-storageclass-sc-k8s-storage-df71ca0fccc3 https://www.atlassian.com/devops https://avinetworks.com/glossary/service-discovery/ https://avinetworks.com/glossary/service-discovery/ https://conscia.com/se/losningar/digital-transformation/devops/ https://conscia.com/se/losningar/digital-transformation/devops/ https://github.com/containerd/containerd/tree/main https://github.com/containerd/containerd/tree/main https://hub.docker.com/ [13] Docker, Inc. Docker, March 20 2013. https://www.docker.com/. [14] DVC.ai. CML - Continuous Machine Learning. https://cml.dev/. Ac- cessed: Tuesday 11th June, 2024. [15] Falcon, William and The PyTorch Lightning team. PyTorch Lightning. https://www.pytorchlightning.ai, Repository Code: https: //github.com/Lightning-AI/lightning. [16] flannel-io. Flannel. https://github.com/flannel-io/flannel. Ac- cessed: Tuesday 11th June, 2024. [17] Gift, N., and Deza, A. Practical MLOps. O’Reilly Media, Inc., 2021. [18] Glenn Jocher. Ultralytics YOLOv5, 2020. https://github.com/ ultralytics/yolov5. [19] Google. What is container orchestration? https://cloud.google.com/ discover/what-is-container-orchestration. [20] Google Cloud. MLOps: Continuous Delivery and Automation Pipelines in Machine Learning, 2023. https://cloud.google.com/architecture/ mlops-continuous-delivery-and-automation-pipelines-in-machine-learning. [21] Harinderjit Singh. Inspecting and Understanding Ku- bernetes (k8s) Service Network. https://itnext.io/ inspecting-and-understanding-service-network-dfd8c16ff2c5. Ac- cessed: Tuesday 11th June, 2024. [22] Holder, I. Kubernetes 1.28: Beta support for using swap on linux. https://kubernetes.io/blog/2023/08/24/swap-linux-beta/#:˜: text=Since%20enabling%20swap%20permits%20greater,account%20for% 20swap%20memory%20usage, August 2023. [23] Hugging Face. https://huggingface.co/. Accessed: Tuesday 11th June, 2024. [24] Iterative. example cml. https://github.com/iterative/example_cml. Commit: 56de3ef. [25] Kaggle. https://www.kaggle.com/. Accessed: Tuesday 11th June, 2024. [26] Karamitsos, I., Albarhami, S., and Apostolopoulos, C. Applying devops practices of continuous automation for machine learning. Information 11, 7 (2020). [27] KodeKloud. Kube-Proxy: What Is It and How It Works. https: //kodekloud.com/blog/kube-proxy/. [28] Koukis, G., Skaperas, S., Kapetanidou, I. A., Mamatas, L., and Tsaoussidis, V. Performance evaluation of kubernetes networking ap- proaches across constraint edge environments, 2024. [29] Kreuzberger, D., Kühl, N., and Hirschl, S. Machine learning opera- tions (mlops): Overview, definition, and architecture. IEEE Access 11 (2023), 31866–31879. 50 https://www.docker.com/ https://cml.dev/ https://www.pytorchlightning.ai https://github.com/Lightning-AI/lightning https://github.com/Lightning-AI/lightning https://github.com/flannel-io/flannel https://github.com/ultralytics/yolov5 https://github.com/ultralytics/yolov5 https://cloud.google.com/discover/what-is-container-orchestration https://cloud.google.com/discover/what-is-container-orchestration https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning https://itnext.io/inspecting-and-understanding-service-network-dfd8c16ff2c5 https://itnext.io/inspecting-and-understanding-service-network-dfd8c16ff2c5 https://kubernetes.io/blog/2023/08/24/swap-linux-beta/#:~:text=Since%20enabling%20swap%20permits%20greater,account%20for%20swap%20memory%20usage https://kubernetes.io/blog/2023/08/24/swap-linux-beta/#:~:text=Since%20enabling%20swap%20permits%20greater,account%20for%20swap%20memory%20usage https://kubernetes.io/blog/2023/08/24/swap-linux-beta/#:~:text=Since%20enabling%20swap%20permits%20greater,account%20for%20swap%20memory%20usage https://huggingface.co/ https://github.com/iterative/example_cml https://www.kaggle.com/ https://kodekloud.com/blog/kube-proxy/ https://kodekloud.com/blog/kube-proxy/ [30] Li, S., Zhao, Y., Varma, R., Salpekar, O., Noordhuis, P., Li, T., Paszke, A., Smith, J., Vaughan, B., Damania, P., and Chintala, S. Pytorch distributed: Experiences on accelerating data parallel training, 2020. [31] Linux.com Editorial Staff, and Gary Sims. All about linux swap space. Linux.com (September 2007). Accessed: Tuesday 11th June, 2024. [32] Merkel, D. Docker: lightweight linux containers for consistent development and deployment. Linux Journal 2014 (03 2014). [33] Miona Aleksic. Containerization vs. Virtualization : understand the differences. https://ubuntu.com/blog/ containerization-vs-virtualization. Accessed: Tuesday 11th June, 2024. [34] The MIT License (MIT). https://spdx.org/licenses/MIT.html. Accessed: Tuesday 11th June, 2024. [35] Neal Analytics. MLOps. https://nealanalytics.com/expertise/ mlops/, Accessed: Tuesday 11th June, 2024. [36] Nicolas Ehrman. Container Runtimes Explained, 2024. https://www. wiz.io/academy/container-runtimes. [37] Oracle Corporation. VirtualBox, January 17 2007. https://www. virtualbox.org/. [38] Overflow, S. Developer Sentiment around AI/ML. https:// stackoverflow.co/labs/developer-sentiment-ai-ml/. Accessed: Tues- day 11th June, 2024. [39] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E. Z., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. CoRR abs/1912.01703 (2019). [40] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12 (2011), 2825– 2830. [41] Pierre Leveau. How to Perform Distributed Training? https://kili-technology.com/data-labeling/machine-learning/ how-to-perform-distributed-training. [42] PyTorch. TorchX. https://github.com/pytorch/torchx. Accessed: Tuesday 11th June, 2024. 51 https://ubuntu.com/blog/containerization-vs-virtualization https://ubuntu.com/blog/containerization-vs-virtualization https://spdx.org/licenses/MIT.html https://nealanalytics.com/expertise/mlops/ https://nealanalytics.com/expertise/mlops/ https://www.wiz.io/academy/container-runtimes https://www.wiz.io/academy/container-runtimes https://www.virtualbox.org/ https://www.virtualbox.org/ https://stackoverflow.co/labs/developer-sentiment-ai-ml/ https://stackoverflow.co/labs/developer-sentiment-ai-ml/ https://kili-technology.com/data-labeling/machine-learning/how-to-perform-distributed-training https://kili-technology.com/data-labeling/machine-learning/how-to-perform-distributed-training https://github.com/pytorch/torchx [43] PyTorch. TorchX. https://github.com/pytorch/torchx. Accessed: Tuesday 11th June, 2024. [44] Quobyte. What is Kubernetes and How It Works. https://www.quobyte. com/storage-explained/what-is-kubernetes/. Accessed: Tuesday 11th June, 2024. [45] Raft Team. The Raft Consensus Algorithml. https://raft.github.io/. [46] Raḿırez, S. Fastapi, 2020. https://fastapi.tiangolo.com, code reposi- tory: https://github.com/tiangolo/fastapi. [47] Recupito, G., Pecorelli, F., Catolino, G., Moreschini, S., Nucci, D. D., Palomba, F., and Tamburri, D. A. A multivocal literature review of mlops tools and features. In 2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA) (2022), pp. 84–91. [48] Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. You only look once: Unified, real-time object detection, 2016. [49] Regehr, J. Race condition vs. data race. https://blog.regehr.org/ archives/490, March 13 2011. Embedded in Academia. [50] Sahid. Kubernetes dashboard: An overview, installation, and accessing. https://k21academy.com/docker-kubernetes/kubernetes-dashboard/, December 8 2023. [51] Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., and Denni- son, D. Hidden technical debt in machine learning systems. In Advances in Neural Information Processing Systems (2015), C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, Eds., vol. 28, Curran Associates, Inc. [52] SelimSavas. forklift-and-people-detection-with-YOLOv5. https://github. com/SelimSavas/forklift-and-people-detection-with-YOLOv5. Ac- cessed: Tuesday 11th June, 2024. [53] Shingai Zivuku. In-Depth Analysis of Kuber- netes Scheduling. https://zivukushingai.medium.com/ in-depth-analysis-of-kubernetes-scheduling-abf08f949924. [54] Stack Overflow. Stack Overflow Annual Developer Survey (2023). https: //survey.stackoverflow.co/2023/. Accessed: Tuesday 11th June, 2024. [55] Szeliski, R. Computer Vision: Algorithms and Applications, 2nd ed. Springer, 2022. Final draft, September 30, 2021. [56] The Kubernetes Authors. Cloud Controller Manager. https:// kubernetes.io/docs/concepts/architecture/cloud-controller/. Ac- cessed: Tuesday 11th June, 2024. [57] The Kubernetes Authors. Command line tool (kubectl). https:// kubernetes.io/docs/reference/kubectl/. Accessed: Tuesday 11th June, 2024. 52 https://github.com/pytorch/torchx https://www.quobyte.com/storage-explained/what-is-kubernetes/ https://www.quobyte.com/storage-explained/what-is-kubernetes/ https://raft.github.io/ https://fastapi.tiangolo.com https://blog.regehr.org/archives/490 https://blog.regehr.org/archives/490 https://k21academy.com/docker-kubernetes/kubernetes-dashboard/ https://github.com/SelimSavas/forklift-and-people-detection-with-YOLOv5 https://github.com/SelimSavas/forklift-and-people-detection-with-YOLOv5 https://zivukushingai.medium.com/in-depth-analysis-of-kubernetes-scheduling-abf08f949924 https://zivukushingai.medium.com/in-depth-analysis-of-kubernetes-scheduling-abf08f949924 https://survey.stackoverflow.co/2023/ https://survey.stackoverflow.co/2023/ https://kubernetes.io/docs/concepts/architecture/cloud-controller/ https://kubernetes.io/docs/concepts/architecture/cloud-controller/ https://kubernetes.io/docs/reference/kubectl/ https://kubernetes.io/docs/reference/kubectl/ [58] The Kubernetes Authors. Creating a cluster with kubeadm. https://kubernetes.io/docs/setup/production-environment/tools/ kubeadm/create-cluster-kubeadm/. Accessed: Tuesday 11th June, 2024. [59] The Kubernetes Authors. Deploy and Access the Kubernetes Dashboard. https://kubernetes.io/docs/tasks/access-application-cluster/ web-ui-dashboard/. Accessed: Tuesday 11th June, 2024. [60] The Kubernetes Authors. Installing kubeadm. https: //kubernetes.io/docs/setup/production-environment/tools/ kubeadm/install-kubeadm/. Accessed: Tuesday 11th June, 2024. [61] The Kubernetes Authors. kube-scheduler. https://kubernetes.io/ docs/reference/command-line-tools-reference/kube-scheduler/. Ac- cessed: Tuesday 11th June, 2024. [62] The Kubernetes Authors. Kubernetes. https://kubernetes.io/. Ac- cessed: Tuesday 11th June, 2024. [63] The Kubernetes Authors. Kubernetes Components. https:// kubernetes.io/docs/concepts/overview/components/. Accessed: Tues- day 11th June, 2024. [64] The Kubernetes Authors. Migrating from dockershim. https://kubernetes.io/docs/tasks/administer-cluster/ migrating-from-dockershim/. Accessed: Tuesday 11th June, 2024. [65] The Kubernetes Authors. Pods. https://kubernetes.io/docs/ concepts/workloads/pods/. Accessed: Tuesday 11th June, 2024. [66] The Kubernetes Authors. Pull an Image from a Private Reg- istry. https://kubernetes.io/docs/tasks/configure-pod-container/ pull-image-private-registry/. Accessed: Tuesday 11th June, 2024. [67] The Kubernetes Authors. Resource Management for Pods and Containers. https://kubernetes.io/docs/concepts/configuration/ manage-resources-containers/#resource-units-in-kubernetes. Ac- cessed: Tuesday 11th June, 2024. [68] The Kubernetes Authors. The Kubernetes API. https://kubernetes. io/docs/concepts/overview/kubernetes-api/. Accessed: Tuesday 11th June, 2024. [69] The Kubernetes Authors. Viewing Pods and Nodes. https: //kubernetes.io/docs/tutorials/kubernetes-basics/explore/ explore-intro/. Accessed: Tuesday 11th June, 2024. [70] The Kubernetes Authors. Viewing Pods and Nodes. https: //kubernetes.io/docs/concepts/storage/persistent-volumes/. Ac- cessed: Tuesday 11th June, 2024. [71] The Samba Team. SAMBA. https://www.samba.org/. 53 https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/ https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/ https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/ https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/ https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/ https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/ https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/ https://kubernetes.io/docs/reference/command-line-tools-reference/kube-scheduler/ https://kubernetes.io/docs/reference/command-line-tools-reference/kube-scheduler/ https://kubernetes.io/ https://kubernetes.io/docs/concepts/overview/components/ https://kubernetes.io/docs/concepts/overview/components/ https://kubernetes.io/docs/tasks/administer-cluster/migrating-from-dockershim/ https://kubernetes.io/docs/tasks/administer-cluster/migrating-from-dockershim/ https://kubernetes.io/docs/concepts/workloads/pods/ https://kubernetes.io/docs/concepts/workloads/pods/ https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/ https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/ https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-units-in-kubernetes https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-units-in-kubernetes https://kubernetes.io/docs/concepts/overview/kubernetes-api/ https://kubernetes.io/docs/concepts/overview/kubernetes-api/ https://kubernetes.io/docs/tutorials/kubernetes-basics/explore/explore-intro/ https://kubernetes.io/docs/tutorials/kubernetes-basics/explore/explore-intro/ https://kubernetes.io/docs/tutorials/kubernetes-basics/explore/explore-intro/ https://kubernetes.io/docs/concepts/storage/persistent-volumes/ https://kubernetes.io/docs/concepts/storage/persistent-volumes/ https://www.samba.org/ [72] Tigera. Calico. https://github.com/projectcalico/calico. Accessed: Tuesday 11th June, 2024. [73] VICTOR HERNANDO. How to Monitor kube-controller-manager. https: //sysdig.com/blog/how-to-monitor-kube-controller-manager/. Ac- cessed: Tuesday 11th June, 2024. [74] Vijayakumar, A., and Vairavasundaram, S. Yolo-based object de- tection models: A