Exploring Optimized CPU-Inference for Latency-Critical Machine Learning Tasks An evaluation of CPUs as an alternative hardware for real-time computer vision applications by using model compression Master’s thesis in Complex Adaptive Systems MAX SEDERSTEN AMANDA SIKLUND DEPARTMENT OF PHYSICS CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2024 www.chalmers.se www.chalmers.se Master’s thesis 2024 Exploring Optimized CPU-Inference for Latency-Critical Machine Learning Tasks An evaluation of CPUs as an alternative hardware for real-time computer vision applications by using model compression MAX SEDERSTEN AMANDA SIKLUND Department of Physics Chalmers University of Technology Gothenburg, Sweden 2024 Exploring Optimized CPU-Inference for Latency-Critical Machine Learning Tasks An evaluation of CPUs as an alternative hardware for real-time computer vision applications by using model compression MAX SEDERSTEN AMANDA SIKLUND © MAX SEDERSTEN, AMANDA SIKLUND, 2024. Supervisor: Filip Wikman, Tenfifty Examiner: Mats Granath, Department of Physics Master’s Thesis 2024 Department of Physics Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria Printed by Chalmers Reproservice Gothenburg, Sweden 2024 iv Exploring Optimized CPU-Inference for Latency-Critical Machine Learning Tasks An evaluation of CPUs as an alternative hardware for real-time computer vision applications by using model compression MAX SEDERSTEN, AMANDA SIKLUND Department of Physics Chalmers University of Technology Abstract In recent years, machine learning has grown to become increasingly prevalent for a wide range of applications spanning multiple industries. For some of these applica- tions, low latency can be critical, which may limit the types of hardware that can be used. Graphical Processing Units (GPUs) have long been the go-to hardware for machine learning tasks, often outperforming alternatives like Central Process- ing Units (CPUs), but these are not practical in all situations. We explore CPUs, leveraging modern optimization techniques like pruning and quantization, as a com- petitive alternative to GPUs with comparable predictive performance. This thesis provides a comparison of the two hardware types on a real-time latency-critical vi- sion task. On the GPU side, TensorRT in combination with quantization is used to achieve state-of-the-art inference performance on the hardware. On the CPU side, the model is optimized using SparseML to introduce unstructured sparsity and quantization. This optimized model is then used by the DeepSparse runtime engine for optimized inference. Our findings show that the CPU approach can outperform the GPU hardware in certain situations. This suggests that CPU hardware could potentially be used in applications previously limited to GPUs. Keywords: machine learning, neural network, model compression, pruning, quanti- zation, optimization, CPU, GPU, Neural Magic, NVIDIA v Acknowledgements We would like to thank our supervisor at Tenfifty, Filip Wikman, for his guidance and support throughout this thesis. His expertise and insightful contributions have been a valuable part of shaping the direction and outcomes of our work. We would also like to thank our supervisor and examiner at Chalmers, Mats Granath, for his valuable feedback and assistance, particularly in providing insightful guidance that helped us shape the project outline. Max Sedersten and Amanda Siklund, Gothenburg, June 2024 vii List of Acronyms Below is the list of acronyms that have been used throughout this thesis listed in alphabetical order: AI Artificial Intelligence ASIC Application-Specific Integrated Circuit AP Average Precision CNN Convolutional Neural Network CPU Central Processing Unit GPU Graphical Processing Unit IoU Intersection over Union mAP mean Average Precision OBD Optimal Brain Damage OBS Optimal Brain Surgeon OKS Object Keypoint Similarity ONNX Open Neural Network Exchange PTQ Post Training Quantization QAT Quantization Aware Training ReLU Rectified Linear Unit TPU Tensor Processing Unit YOLO You Only Look Once ix Contents List of Acronyms ix List of Figures xiii List of Tables xv 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Theory 7 2.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . 7 2.2 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Unstructured pruning . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.2 Structured pruning . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.3 Pruning-related fine-tuning . . . . . . . . . . . . . . . . . . . . 10 2.2.4 Post-training pruning . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 Post Training Quantization (PTQ) . . . . . . . . . . . . . . . 12 2.3.2 Quantization Aware Training (QAT) . . . . . . . . . . . . . . 12 2.4 SparseML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5 DeepSparse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.6 TensorRT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.7 ONNX Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.8 Pose Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.9 YOLOv8 Pose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3 Methods 21 3.1 CPU evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1.3 Sparsification recipes selection . . . . . . . . . . . . . . . . . . 23 xi Contents 3.2 GPU evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2.2 Model conversion process . . . . . . . . . . . . . . . . . . . . . 25 3.3 Validation setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4 Results 27 4.1 Predictive performance . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2 Inference time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3 Combined evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5 Discussion 35 5.1 Method and background . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.2 Key findings analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.4 Ethical considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.5 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6 Conclusion 39 References 44 xii List of Figures 2.1 Visualization of a 2D convolution, including the input matrix (left), the convolutional kernel (center), and the resulting output matrix (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Illustration of structured and unstructured pruning. Left side shows the removal of structured groups of weights, while the right side shows the removal of weights in a unstructured manner. . . . . . . . . . . . 9 2.3 Visualization of zero-point quantization, illustrating the mapping of a floating-point distribution onto a lower-precision representation. Z represents the zero-point value. . . . . . . . . . . . . . . . . . . . . . 12 2.4 Illustrated example of a convolutional layer with QAT integration. . . 13 2.5 Flowchart of the model optimization process when using SparseML and DeepSparse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.6 Visualization of how DeepSparse utilizes sparsity on CPU hardware. On the left, traditional execution on CPU is shown, while on the right, the sparse model is decomposed into column tensors that can run independently on individual cores. . . . . . . . . . . . . . . . . . 15 2.7 Visualization of IoU, being the ratio between the intersection area and the union area between two bounding box instances. . . . . . . . 16 2.8 Visualization of the similarity score distribution of two different point types. Similarity score above 0.5 is represented by the red inner ring. Original image by Franki Chamaki on Unsplash [37]. . . . . . . . . . 17 2.9 Example of the output predicted by the YOLOv8s pose model, in- cluding detected keypoints, bounding boxes, and the associated con- fidence scores. Original image by John Doe on Unsplash [40]. . . . . . 19 4.1 Comparison of precision for both pose and box predictions between the optimized models and the base case. . . . . . . . . . . . . . . . . 28 4.2 Comparison of recall for both pose and box predictions between the optimized models and the base case. . . . . . . . . . . . . . . . . . . 28 4.3 Comparison of mAP at an IoU/OKS threshold of 50% for pose and box predictions between the optimized models and the base case. . . 29 4.4 Comparison of mAP across IoU/OKS thresholds ranging from 50% to 95% for both pose and box predictions between the optimized models and the base case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 xiii List of Figures 4.5 Comparison of the models’ inference times across different hardware and core configurations. The SparseML models are evaluated on an AMD Genoa with 8-, 15-, and 30-core configurations, while the in- ference times for TensorRT and ONNX Runtime are evaluated on a NVIDIA Jetson AGX Orin. . . . . . . . . . . . . . . . . . . . . . . . 31 4.6 Results of pose mAP over inference time for at an OKS threshold of 50%. The SparseML models are evaluated on an AMD Genoa 30-core processor, while the TensorRT models are evaluated on a NVIDIA Jetson AGX Orin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.7 Results of pose mAP over inference time at an OKS threshold of 50% across all hardware configurations. This includes the TensorRT models, the quantized SparseML models, as well as the pruned and quantized SparseML models. . . . . . . . . . . . . . . . . . . . . . . . 33 xiv List of Tables 3.1 Overview of the models trained with SparseML. The models are de- noted with "sml" to indicate their use of SparseML, "p..." to define their sparsity level, and "int8" to specify quantization. The recipes for these models are sourced from SparseZoo, with an "m" denoting any modifications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Overview of models converted to TensorRT and ONNX formats, in- dicated by "trt" and "onnx" respectively, followed by the numeric pre- cision.¢ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.1 Overview of the models’ predictive performance, including precision (P), recall (R), and mAP for both box and pose predictions. . . . . . 30 4.2 Overview of the inference times in milliseconds across various hard- ware configurations, including AMD Genoa with 8-, 15-, and 30-core configurations, as well as NVIDIA Jetson AGX Orin (64GB). . . . . . 32 xv List of Tables xvi 1 Introduction Today, machine learning is a rapidly evolving field with uses in a wide range of industries. The recent introduction of ChatGPT by OpenAI last year has catapulted the term "AI" into the spotlight, resulting in a lot of buzz around the topic [1]. Companies have also started to notice, with many committing large amounts of capital to their own AI research, so much so that the main supplier of AI hardware, NVIDIA, has hit an evaluation of over 2 trillion dollars [2]. NVIDIA’s position in the market is no coincidence. The company’s primary focus has for a long time been Graphical Processing Units (also known as GPUs), which is hardware specifically designed for graphically intensive workloads. These types of computational workloads are often heavily parallelizable in nature, a fact that is also true for many machine learning workloads. Common neural network structures such as Convolutional Neural Networks (CNN), and standard fully connected networks, are implemented based on matrix operations that can take advantage of the paral- lelization capabilities of a GPU. GPUs, with their relatively large and high-speed memory, offer the advantage of running and training models without constantly reading and writing from slower memory. As a result, GPUs often outperform other types of hardware such as Central Processing Units (CPUs), both in terms of latency and throughput, for most machine learning applications. But what is optimal also depends a lot on the application in question and the sur- rounding limitations. In some situations, power consumption might be the biggest concern, in others it might be the cost. Yet others put demands on latency which might completely change the appropriate hardware. For example, real-time appli- cations might be limited to running on edge devices since the server latency might be too high [3]. This in turn may come with power limitations while at the same time being limited to the hardware that is already available. Yet another factor that could change the appropriate hardware type is model op- timization techniques. Model optimization, in the context of machine learning, is the process of making changes to a model (or runtime) so that it runs faster and/or more efficiently on a given set of hardware, without significantly altering the model’s behavior. There are several different approaches to model optimiza- tion, with a promising sub-field being model compression. The use of compression 1 1. Introduction techniques – like pruning and quantization – can yield considerable speed-ups for inference on some hardware types while also reducing the storage footprint. 1.1 Background For decades, researchers have worked to emulate the brain’s complex functions using artificial systems. These efforts inspired to the creation of Artificial Neural Networks, which are network structures with nodes representing neurons and weights repre- senting connections. These structures are a cornerstone of modern machine learning and have proven to be highly capable. However, over time, a trend of larger and larger network sizes has emerged resulting in higher computational demands. In response, the biomimicry-based concept known as pruning, which draws inspi- ration from the brain’s ability to refine its neural connections over time, has been explored as a way of making models more efficient [4]. In the context of neural networks, pruning is the process of removing specific weights and/or neurons based on their importance, in order to simplify the model. Previous studies have proposed different pruning techniques with very promising results, as shown in [5]. It demonstrates that pruning unnecessary weights can give great results, with large sparse models outperforming small dense ones of the same size, on multiple fronts. Other previous works have explored alternative approaches for determining what to prune in the network, such as removing individual nodes or groups of nodes [6], [7], [8]. These still face a few challenges, however. Aggres- sive pruning may incur significant information loss which in turn can impact the predictive performance. The theoretical speed-ups possible through pruning have typically only been realized using coarser pruning strategies or implementations leveraging very specific network structures, but this is not true for the case of SparseML. SparseML – developed by Neural Magic – is an optimization library that compresses models for efficient inference on CPU using their runtime engine: DeepSparse [9]. It achieves substantial speed-ups on a wide range of model types with the use of fine-grained pruning strategies that can be efficiently utilized by the underlying structure of a CPU, something that can not be said about GPUs. CPUs – like the ones powering computers and smartphones – are more versatile in nature and are thus able to handle a wider variety of computations. They usu- ally have a smaller number of more powerful cores compared to the relatively high number of "simpler" cores found in GPUs. The reason for this is that CPUs are op- timized for sequential workloads, unlike the parallelization approach of GPUs. This – in combination with the fact that CPUs’ dedicated memory is relatively small – results in a lot of inefficient memory transfers, often with the same data being shuffled in and out of memory multiple times during inference. This makes CPUs inferior for most machine learning applications, with their only advantage being their versatility and widespread use. However, Neural Magic claims they are able to 2 1. Introduction provide GPU-level inference speed on CPUs thanks to their optimization techniques [10]. Another notable example of machine learning hardware is Tensor Processing Units (TPUs), developed by Google. TPUs are Application-Specific Integrated Circuits (ASICs) specifically designed for use with TensorFlow [11]. These excel at doing large matrix operations at scale and can outperform CPUs and GPUs in regards to throughput on some specific workloads [12]. However, this speed comes at a cost. Factors such as model size and data constraints might have to be considered since TPUs only realize their speed-ups for bigger models and at larger batch sizes, making them impractical for some applications. The model architecture itself can also have a major impact on inference performance with some architectures not being supported at all. This limits the situations where TPUs can effectively be utilized and adds to the complexity of choosing hardware. The GPU has also seen advancements in recent years. Particularly with NVIDIA’s introduction of Tensor Cores [13] together with their inference library: TensorRT [14]. Tensor Cores are specially designed processing units that are well suited for computing parts of neural networks that depend heavily on matrix operations. Ten- sorRT provides the tools needed to convert a model into a format compatible with Tensor Cores, while also providing a compatible runtime for execution. 1.2 Gap Both Neural Magic and NVIDIA make claims about outperforming the other with their proprietary technologies [15] [16]. But despite their claims, there exists no comprehensive comparison that accounts for both optimization techniques as well as recent technological advancements. A reason for this could be that each party tries to showcase its strengths while not shining any light on its weaknesses. NVIDIA for example, when comparing their machine learning accelerators to CPUs, often uses examples that are well suited for parallelization in combination with larger batch sizes when computationally applicable [17]. This benefits NVIDIA since GPUs generally scale well with larger batch sizes, resulting in relatively little additional overhead as long as hardware limits are not exceeded. This showcased inference performance is not in any way incorrect, but it can still be misleading, particularly for the use-case studied in this thesis which is real-time applications. For real-time latency-critical computer vision applications, batching is unlikely to decrease the latency, at least as long as computational power is not the main bottleneck. The last frame in a given batch would, in the ideal case, take at least the same time as computing all the frames sequentially. The first frame meanwhile would have to wait for all subsequent frames in the batch to be created, before processing can start, adding substantially to that frame’s absolute latency. This means that the results presented are not a fair indication of the real-world performance for real-time applications like the one explored in this thesis. 3 1. Introduction Neural Magic on the other hand shows results relevant to the task but it is still lacking in some areas [18]. Firstly, they fail to take recent technological advance- ments by NVIDIA into account in their comparison. Their optimization techniques may also have an impact on the model’s predictive performance, which is not well represented in their results. In conclusion, there seem to be no publicly available apples-to-apples comparisons of current state-of-the-art GPU technology versus traditional CPU hardware em- powered by modern optimization techniques. 1.3 Aim This thesis aims to determine if and when CPUs can be a competitive alterna- tive to modern GPUs, for a given real-time computer vision task, with the aid of optimization techniques. The aim is also to objectively showcase the trade-offs be- tween inference speed and predictive performance inherent with these optimization techniques. 1.4 Limitations This work focuses on the optimization of a convolution-based pose prediction model for use with real-time applications, without considering other types of models, tasks, or network architectures. Additionally, the scope of this work is limited to the model itself, without considering the performance impacts of other parts of the pipeline such as pre-processing and post-processing. The chosen model is also the base for all evaluated model versions with the un- optimized performance serving as the baseline that the results are compared against. No comparison with external benchmarks is performed. The hardware used for evaluation is also limited due to availability. On the GPU side, a Nvidia Jetson AGX Orin is the only hardware used, while on the CPU side, Google Cloud instances limited to a single line of processors are evaluated. Cloud instances with varying numbers of cores are however considered and evaluated. The optimization techniques explored are limited to pruning and quantization, with pruning itself being limited to unstructured pruning. Structured pruning is discussed but not evaluated since improvements would apply to both hardware. Neither is semi-structured pruning explored due to the minimal impact on inference speed when small batch sizes are used, as is common for real-time applications [19]. 1.5 Research Questions This thesis aims to answer the following questions: 4 1. Introduction • Can CPUs be a competitive alternative to GPUs for latency-critical applica- tions when using compression techniques? • What is the predictive performance impact resulting from the use of compres- sion techniques? 5 1. Introduction 6 2 Theory This section covers the theoretical background relevant for the work being done in this thesis. The areas covered mainly revolve around model compression as well as topic relating to model inference. 2.1 Convolutional Neural Networks Convolutional Neural Networks (CNN) are a special type of feed-forward neural network often used for vision tasks like image classification and object detection [20]. These networks have a fewer number of connections (also known as weights) than many other types of neural networks, like fully connected networks, with the same number of neurons. This is due to how the layers within CNNs, called convolutional layers, work. These layers leverage kernels, or filters, which are groups of learnable parameters that slide along the spatial dimensions of the input. In the context of vision-based CNNs, this often involves the use of 2D convolutions where the input consists of two spatial dimensions together with a third optional dimension that represents the color channels, as visualized in Figure 2.1. As the kernel slides along the spatial dimension, it convolves with that part of the input to produce an output corresponding to that region. This essentially means that these weights are being shared among multiple output neurons, leading to the reduction in parameter count. In modern CNN architectures for tasks such as image classification, object detection, and pose estimation, the network is often divided into two main components: the backbone and the head. The backbone serves as the foundational component of the network, responsible for feature extraction from the input data. It consists of a series of hierarchically arranged convolutional layers, designed to capture features from the raw input image or data. The head component is responsible for task- specific processing, taking the features extracted by the backbone and transforming them into predictions relevant to the task at hand. For instance, in object detection, the head may include layers for bounding box regression and classification, while in pose estimation, it may involve layers for keypoint detection and association. 7 2. Theory Figure 2.1: Visualization of a 2D convolution, including the input matrix (left), the convolutional kernel (center), and the resulting output matrix (right). 2.2 Pruning Pruning is an optimization method for neural networks first introduced in [4], but has since then evolved considerably. The general idea behind pruning is to remove nodes and/or weights that are deemed to have low importance for a model’s output. There are a few different approaches that work in slightly different ways and with slightly different goals in mind, but the process can generally be split into three overarching steps: weight/node selection, the actual pruning, and fine-tuning. When it comes to the selection of weights/nodes, there are also a few different meth- ods that work in slightly different ways. Optimal Brain Damage (OBD), introduced by the paper of the same name [4], is a selection method based on the loss function of the model. This method aims to find the weights that have the lowest impact on the loss once removed. Computing the loss impact of every node is not always feasible, however. To solve this, OBD uses a local approximation of the loss function to determine the weights that should be selected. Magnitude-based methods on the other hand are a lot simpler in nature. These methods build on the assumption that weights that shrink during training and end up small, likely have low importance for the final output. Although this is not always the case, it is a good enough approximation that works fairly well, especially compared to purely random methods relying on pure chance alone [21]. Although magnitude-based methods are relatively simple, they have the added benefit of being easier to compute in most situations, while not lagging too far behind other methods [22]. Once the nodes/weights to prune have been established, the pruning can take place 8 2. Theory Figure 2.2: Illustration of structured and unstructured pruning. Left side shows the removal of structured groups of weights, while the right side shows the removal of weights in a unstructured manner. and this, in turn, can be done in a few different ways. The different pruning ap- proaches differ fundamentally from each other, with their advantages, disadvantages, and use cases. The different types of pruning approaches are described in more detail in the following sections. 2.2.1 Unstructured pruning Unstructured pruning is an inherently fine-grained approach with its focus on in- dividual weights, as visualized in Figure 2.2. It revolves around setting individual weights to zero (0), thus effectively removing them in place without making changes to the underlying structure of the network. The resulting model is called a "sparse model" since it is no longer densely connected (although it may still technically be due to weights only being masked). Sparse models have some big implications when it comes to model inference. Any contribution to the output from nodes involving a masked weight is always known to be zero and can thus be ignored during computation. This could in theory provide a substantial boost in inference performance, especially at the sparsification levels shown to be practical in previous work [4] [23]. In practice, however, it is not that simple. Hardware type plays a major role in the effectiveness of fine-grained pruning strate- gies. GPUs, as mentioned in Section 1, rely heavily on parallelization due to how they are fundamentally designed to operate [24]. This design provides great per- formance for inference and is one of the reasons they are so dominant for machine learning tasks. But one thing they tend to struggle with is the utilization of sparse computations. The computation time for a parallel computation is nearly constant, as the number of operations done in parallel scale. This means that on hardware that leans heavily on parallelization, like GPUs, pruning provides little to no benefit in terms of speed-ups. Pruning also has other benefits. It allows for a smaller storage footprint than their 9 2. Theory dense counterpart [5]. But this size reduction depends heavily on how the sparse model is stored as well as its sparsification level. Since the sparsification itself also adds some storage overhead, the reduction may vary. 2.2.2 Structured pruning Structured pruning is a coarser approach where collections of weights that make up structural parts of the neural network are removed [25], as shown in Figure 2.2. There are a few different types of structured pruning that trim different parts of the structure and are suited for different types of model architectures. A simple approach specifically relevant for CNNs is filter pruning [26]. Filter pruning, as the name suggests, involves the removal of entire filters, a pro- cess that does not introduce any sparsity and leaves the model dense. This means the model can be computed in the same way as before, which has the benefit of not requiring any specialized hardware or software implementations to be utilized. Any theoretical speed-ups will be realized on any previously compatible hardware, allowing for painless deployment. There are, however, some major drawbacks. Structured pruning is inherently a lot coarser in nature than unstructured pruning [24]. The least important collec- tion of weights may still contain important information in some weights that will be discarded along with the rest of them. This results in quicker drops in predictive per- formance as compression levels increase, compared to unstructured pruning, which limits the amount of pruning that can practically be applied through structured pruning. 2.2.3 Pruning-related fine-tuning Pruning is most often accompanied by training or fine-tuning. When pruning a model that will be trained from scratch, the pruning is typically incorporated into the main training loop so that it is performed iteratively throughout the training process [4]. Similarly, when pruning a pre-trained model, it is often done in an iterative process, alternating between pruning and fine-tuning. The reason for this is that the removal of weights can, and likely will make the model diverge from the previous solution. Thus there is the need for re-calibrations to compensate for this change, a process that requires a dataset. The hope is that the model retains the intended information while discarding redun- dant information in the process. This is the reason why it is often done iteratively and not all at once. When pruning iteratively, the information can be distilled be- tween pruning steps [27]. This in turn changes the state of the model which can change the weight magnitude distribution completely. For example, weights that were initially small but not small enough to be pruned can come to contain distilled information after the fine-tuning, either directly from the information contained in the removed weights or simply due to the changing needs of the network around it. 10 2. Theory 2.2.4 Post-training pruning Although pruning often is done iteratively, in tandem with training, it can still be done in a single step. This is referred to as one-shot or post-training pruning, and this approach has the added benefit of being a lot simpler to implement and compute. Where it falls short however is predictive performance [28]. One-shot pruning tends to incur higher losses in predictive performance compared to gradual pruning techniques at the same compression rate. 2.3 Quantization Neural networks can be quantized to reduce the numeric precision of the network’s weights and activations. This process involves representing the weights and ac- tivations with lower bit-widths (e.g., 8-bit integers instead of 32-bit floating-point numbers), with the primary goal of reducing the computational and memory require- ments of neural networks. It can significantly enhance the inference performance of neural networks, leading to faster inference times, lower power consumption, and reduced memory usage. This has been shown to greatly increase computational efficiency on a wide variety of models to a high degree, with minimal impact on precision [29]. There are two main ways that quantization can be applied, namely, post-training quantization (PTQ) and quantization-aware training (QAT). The first step of quantization is typically to determine the fixed parameters such that it minimizes the information loss during conversion. These parameters vary based on the conversion technique used, with the main two being symmetric and asymmetric quantization. Symmetric quantization uses the absolute maximum value to map the weights/activations, while asymmetric quantization additionally uses the zero-point value. Absolute maximum quantization minimizes the information loss by mapping the absolute maximum value of all weights to the min/max values of the quantized range. This is done by dividing the maximum original value, |rmax|, and scaled by a factor S. This process can be represented as q = S r |rmax| (2.1) Here, q represents the quantized value, r the original value, and S the scaling factor. Zero-point quantization is typically used when dealing with an asymmetric distri- bution. This could for example occur when dealing with only positive values, such as those resulting from the ReLU (Rectified Linear Unit) function. ReLU is an ac- tivation function replacing all negative input values with zero and is very common for CNNs. The conversion in zero-point quantization includes two types of parameters: one for scaling the values and another representing the zero-point value. A representation 11 2. Theory Figure 2.3: Visualization of zero-point quantization, illustrating the mapping of a floating-point distribution onto a lower-precision representation. Z represents the zero-point value. of how the values are scaled and mapped in relation to the zero-point value is visual- ized in Figure 2.3. This example demonstrates a skewed floating-point distribution mapped onto a representation with lower resolution, which includes a zero-point value Z. The relationship between the original values r and quantized values q is calculated as r = S(q − Z) (2.2) which includes both the scaling factor S and the zero-point parameter Z. 2.3.1 Post Training Quantization (PTQ) Applying quantization after training is a simpler approach compared to QAT because it does not require adjusting the training process to account for quantization [30]. Instead, quantization is applied to the model’s weights after training. However, there are potential drawbacks to this approach. Since quantization is not considered during training, applying it to an already trained model may lead to severe precision loss. To address this quantization error, the quantized model can be fine-tuned to compensate for this loss in accuracy. This will however not yield the same result as through the use of QAT. 2.3.2 Quantization Aware Training (QAT) A challenge encountered during quantization is the drop in model accuracy resulting from the reduced numeric precision of weights. This is addressed by emulating the impact of the lower precision to make the model account for these changes during training. This is done by introducing augmenting operations that emulate the effects of quantization during the forward pass, without modifying the numeric precision of the parameters or the rest of the training process [30]. An example of quantizing a convolutional layer is visualized in Figure 2.4. Typically, the weights are quantized before being multiplied or convolved with the input, while the outputs are generally quantized after the activation function. This approach is beneficial since the activation function is fused with the main operation in the most optimized hardware setups. The same approach is applied to other layers, such as concatenation and addition, allowing the model to adjust the parameters for lower precision. 12 2. Theory Figure 2.4: Illustrated example of a convolutional layer with QAT integration. Figure 2.5: Flowchart of the model optimization process when using SparseML and DeepSparse. When propagating backward, the gradients of the loss with respect to the model parameters are computed while skipping the quantization step. This allows the model to optimize its parameters while considering the effects of quantization. 2.4 SparseML SparseML is an optimization library developed by Neural Magic that attempts to leverage unstructured pruning together with other compression techniques for high inference performance on CPUs [10]. Models that are compressed using SparseML are specifically optimized for use with using their runtime engine DeepSparse [18], designed to take advantage of the underlying structure of CPU hardware. A diagram for the sparsification process is shown in figure 2.5. 13 2. Theory There are two different approaches to how SparseML can be used. One involves fine- tuning an already pruned model, while the other involves sparsifying a model from scratch. The already sparsified models, along with their corresponding sparsification recipes, are available at Neural Magic’s repository called SparseZoo [31]. These models can be retrained on a new task or utilized as they are. SparseML integrates with various frameworks, such as PyTorch and TensorFlow, by utilizing the callbacks integrated into their training processes to apply the com- pression. Since most frameworks already have these callbacks, no modifications to the framework are needed and the training process can proceed as normal with compression. The compression applied during training is configurable and can be defined in a file known as a sparsification recipe. Besides specifying training-related parameters such as learning rate and number of epochs, it also specifies if and how pruning and optimization should be applied. The types of compression that SparseML supports include both gradual and one-shot pruning techniques. Additionally, it supports both QAT and PTQ, with QAT being the dominant method used in Neural Magic’s model repository: SparseZoo [31]. Typically, when applying both pruning and quantization, the process involves ini- tially stabilizing the model for a few epochs, followed by a gradual application of pruning, and then lastly applying quantization, accompanied by calibration for the case of QAT. 2.5 DeepSparse DeepSparse is a runtime engine developed by Neural Magic that is optimized for running sparse models on x86 CPUs. It achieves speed-ups through something called Column Tensors [18]. Column Tensors are isolated groups of connected neurons within the neural network that span it in a depth-wise fashion as visualized in Figure 2.6. Unstructured pruning introduces "cavities" into the network structure, allowing it to be broken into column tensors, something that can not be done with an unpruned model due to its densely connected nature. For dense networks, no simpler representation exists, so no simplifications can be made. Column tensors are well suited for the x86 architecture since they are often small enough to fit entirely into the CPU cache, a limiting factor of CPUs as previously mentioned in Section 1.1. The isolated nature of these Column Tensors also enables them to be computed completely independently on a single processor core. There- fore, these computations can be effectively distributed among the small number of available cores typical for CPUs, thus speeding up computation. 14 2. Theory Figure 2.6: Visualization of how DeepSparse utilizes sparsity on CPU hardware. On the left, traditional execution on CPU is shown, while on the right, the sparse model is decomposed into column tensors that can run independently on individual cores. 2.6 TensorRT TensorRT is a software development kit developed by NVIDIA for optimizing deep learning models for deployment on NVIDIA GPUs. The optimization done by Ten- sorRT takes advantage of a special type of processing unit called Tensor Cores and was first included with NVIDIA’s Volta architecture in 2017 [32]. Tensor Cores are specially designed hardware that performs fused multiply-add com- putations in a single step. These fused multiply-add operations can be utilized to quickly solve matrix operations, like the ones used for efficient computation of some neural network structures like convolutional layers [33]. The fused multiply-add operation takes three 4x4 matrices, denoted as A, B and C, and the output D calculated as A4×4 · B4×4 + C4×4 = D4×4 (2.3) Although the unit only performs these operations on matrices of size 4x4, it is still able to accelerate some workloads considerably. NVIDIA claims up to six times the number of floating-point operations per second (FLOPS) for inference compared to the previous generations of hardware [32] and research has also shown inference to be around 65% [34] faster on the same hardware when Tensor Cores are utilized compared to without. Convolution-based neural networks are also well suited to take advantage of this type of hardware since the computation of their convolutional layers effectively can be represented by matrix multiplications [33]. The optimization tools provided by TensorRT help facilitate the conversion process that is needed to make the model suitable for execution with the provided TensorRT runtime. This is done by fusing parts of the neural network structure (such as convolutional layers) into their Tensor Core compatible counterparts. 15 2. Theory Figure 2.7: Visualization of IoU, being the ratio between the intersection area and the union area between two bounding box instances. 2.7 ONNX Runtime Open Neural Network Exchange (ONNX) is an open-source machine learning format that is supported by a wide range of frameworks [35]. The idea behind ONNX is to allow for greater interoperability between different ecosystems encouraging collaboration. ONNX Runtime is a complement to ONNX, functioning as an inference engine used to deploy deep learning models represented in this format. This runtime supports several different hardware platforms including CPUs, GPUs, and other specialized accelerators. The functionality revolves around optimizing the neural network graph by partitioning it into subgraphs such that it fits the specialized hardware. 2.8 Pose Metrics Performance metrics are used to measure the predictive performance of neural net- works using a number of indicators. In pose estimation, which involves the identi- fication of keypoints often corresponding to specific body parts, determining what constitutes a correct prediction can be challenging. For instance, an eye is a distinct feature and is easier to predict with high certainty compared to the ambiguity asso- ciated with the actual location of a shoulder. The correctness of the eye’s prediction is therefore more important than that of the shoulder. A commonly used metric that accounts for this difference is Object Keypoint Similarity (OKS). It considers the distance between the ground truth keypoint and the predicted keypoint, weighted by both the keypoint type as well as accounting for the scale of the object being detected [36]. The process of calculating OKS involves taking the Euclidean distances di between each ground truth and predicted keypoint, then determining the similarity score KSi for each of them. The KSi is calculated by taking the probability density of a Gaussian distribution with standard deviation ski evaluated at di. The standard deviation ski consists of two variables: s representing the detected object’s relative size in the input and ki defining the per-keypoint constant which describes the point’s 16 2. Theory Figure 2.8: Visualization of the similarity score distribution of two different point types. Similarity score above 0.5 is represented by the red inner ring. Original image by Franki Chamaki on Unsplash [37]. ambiguity. Figure 2.8 visualizes the tolerance threshold of different keypoint types. The equation for calculating KSi is KSi = exp ( −d2 i 2s2k2 i ) (2.4) This approach provides a representation of similarity, with higher probabilities in- dicating closer agreement between the predicted and ground truth keypoints. Typi- cally, ki is tuned by measuring the per-keypoint standard deviation with respect to the object scale across a certain number of images from the dataset. This accounts for the scale of the keypoints in the performance measurement. The total OKS is calculated by taking the arithmetic average of these KSi values using Equation 2.5. Here, vi represents the ground truth visibility flag, resulting in vi being 1 if the keypoint is labeled and 0 otherwise. If a keypoint is not labeled (vi = 0), it does not affect OKS, resulting in the average being taken only of (vi = 1). The equation for OKS is defined as OKS = ∑ i KSivi∑ i vi (2.5) In addition to keypoint predictions, some pose models also offer bounding box pre- dictions. The evaluation of these bounding box predictions often involves the in- tersection over union (IoU) metric [36]. IoU involves defining the ratio between the 17 2. Theory intersection area and the union area between two bounding box instances, as shown in Figure 2.7. This metric ranges from 0 to 1, where 1 signifies perfect overlap and 0 indicates no overlap. The IoU and OKS metrics are used to measure the "correctness" of the box and pose predictions respectively. A value above a certain threshold, typically 0.5, signifies a correct prediction, also referred to as a true positive prediction. In this con- text, precision measures the ratio of true positive predictions to the total number of predictions (true positives+false positives), indicating the proportion of correct predictions. Mathematically, it is defined as Precision = True positive True positive + False positive (2.6) Another commonly used metric is recall, which signifies the model’s capability to detect all relevant labels. It calculates the ratio of correctly predicted instances (true positives) to the total number of ground truth instances (true positives+false negatives). This metric is given as Recall = True positive True positive + False negative (2.7) Achieving both a high recall and precision is desirable as it indicates that the model’s predictions are correct and that most labels are detected. However, there is an inherent tradeoff between precision and recall depending on the chosen confidence threshold. To evaluate the balance between them, a precision-recall graph can be plotted across various confidence thresholds at a specific value of IoU/OKS. The area under this curve, referred to as the Average Precision (AP) at that particular value of IoU/OKS, is defined as AP = ∫ 1 0 p(r)dr (2.8) For models with multiple classes, the mean Average Precision (mAP) is used to provide a comprehensive measure of AP. mAP is calculated by averaging the AP across all classes (N), defined as mAP = 1 N N∑ i=1 APi (2.9) This results in a single metric that reflects the model’s overall precision. 18 2. Theory Figure 2.9: Example of the output predicted by the YOLOv8s pose model, in- cluding detected keypoints, bounding boxes, and the associated confidence scores. Original image by John Doe on Unsplash [40]. 2.9 YOLOv8 Pose The YOLOv8 pose model [38] developed by Ultralytics is a variant of the YOLO (You Only Look Once) algorithm [39], which is an object detection algorithm that predicts bounding boxes from an image. YOLOv8 pose is a pose estimation model that comes in a few different sizes, with the small variant having 11.2 million pa- rameters. These models are trained using the COCO pose dataset [36], which is comprised of 200,000 labeled images of human poses. The model’s objective is to predict the poses of all humans present within an image. This is done by predicting the location of a set of keypoints that correspond to specific body parts, as visual- ized in Figure 2.9. Each detection includes a bounding box along with 17 keypoints, accompanied by its corresponding confidence score. The pose model is structurally similar to the YOLOv8 detect models, with the addition of a few layers at the end of the network that handle the prediction of keypoints, as well as a slight difference in parameter count in some layers. The loss function for the YOLOv8 pose model involves a combination of compo- nents designed to optimize both bounding box detection and keypoint estimation. There are two types of loss for both boxes and poses: localization and confidence. Localization loss measures how accurately the model predicts the coordinates while confidence loss penalizes incorrect predictions. This confidence loss is computed using binary cross-entropy, where the model is trained to predict whether an ob- ject is present in each grid cell and how confident it is in that prediction. Binary 19 2. Theory cross-entropy involves quantifying the difference between predicted and labeled prob- abilities for each class to penalize the model more heavily for incorrect predictions [38]. The metrics measured during model validation include precision, recall, and mean Average Precision (mAP). These metrics are computed individually for both bound- ing box and keypoint predictions. In Ultralytics’ validation function, it typically includes two different mAP metrics: mAP50 and mAP50-95. mAP50 is a measure of the mean average precision of predictions over a threshold of 0.5, while mAP50-95 measures the mean average precision of the model’s predictions across a range of IoU and OKS thresholds, from 0.5 to 0.95. This broader assessment evaluates how well the model detects objects or keypoints across different levels of overlap with ground truth annotations. 20 3 Methods To evaluate the capability of CPU being a competitive alternative to GPU for real- time applications, a few different approaches and runtimes were tested. These in- clude DeepSparse, designed for accelerating computations on CPU, TensorRT, a leading-edge technique for fast GPU inference, and ONNX Runtime representing the baseline GPU performance. The base model used for these experiments was the YOLOv8s-pose model developed by Ultralytics. This is a pose estimation model predicting human keypoints, as described in section 2.9. This model was chosen for its efficiency and balance of speed and accuracy, making it well-suited for real-time computer vision tasks in resource-constrained environments. 3.1 CPU evaluation SparseML [10] and DeepSparse [18] were used for speeding up the inference on CPU due to their state-of-the-art capabilities within this field. This process involved retraining the YOLOv8 pose model with optimization techniques, followed by vali- dation on specific hardware to evaluate inference performance. 3.1.1 Hardware For the CPU-based optimization to reach its full potential, the hardware had to be taken into consideration. Quantization performed using SparseML is intended for use with DeepSparse which relies on specific x86 instruction set extensions to work. The ones supported by DeepSparse are AVX2, AVX512, and AVX512 VNNI, with particular optimization for AVX512 VNNI specifically [41]. Although AVX2 and AVX512 are supported through emulation, their performance may not be as efficient. For the experiments, we selected Google Cloud instances equipped with AMD Genoa hardware, which supports AVX512 VNNI. This had the added benefit of enabling an evaluation of different configurations by just changing the number of cores while keeping other factors fixed. The CPU configurations tested included setups with 8, 21 3. Methods 15, and 30 cores, as these were the available options for this instance. 3.1.2 Implementation For pruning and quantization to be applied using SparseML, the model in question has to be supported. SparseML has built-in support for YOLOv8 models, but this implementation was based on an Ultralytics version from before the introduction of pose detection, thus it lacks support for the pose task. This lack of support also meant that there were no pre-sparsed pose models available on Neural Magic’s model repository: SparseZoo. Therefore, the model had to be trained/pruned using SparseML from scratch. Furthermore, newer versions of SparseML (version 1.6.4 and later) also had a pre- viously unreported bug that broke quantization for exported models. This bug was confirmed by the support from Neural Magic, and therefore, a modified version based on an older version had to be created specifically for exporting the trained models. We settled on implementing two different versions of SparseML due to the limitations mentioned above, one for pruning/training of the model, and the other for exporting it. This involves the following implementations in more detail: • Pruning/training modifications To sparsify and/or quantify the YOLOv8 pose model, a newer version of SparseML (1.7.0) was utilized. This version depended on a version of Ul- tralytics (8.0.124) that contained the code related to the pose task. Therefore, an augmented version of SparseML was created by manually overwriting the validation and training classes. These custom class implementations re-used code for the pose functionality already present in Ultralytics. • Export modifications When it came to exporting the models, an older version of SparseML (1.5.4) had to be used, with the reason being a bug that broke quantization in the newer versions. This version of SparseML had a dependency on a version of Ultralytics (8.0.30) from before the introduction of the YOLOv8 pose model, which had a completely different file structure, complicating the implementa- tion process and making it different from the other implementations. Starting from these versions of SparseML and Ultralytics, a modified version of each was created exclusively to facilitate the export process of the models. The modifications entailed the re-implementation of some pose-related functional- ity as well as extending the existing code to handle pose models. Using a newer version of SparseML only for training, with its dependency on an Ultralytics version supporting pose, was deemed to be of enough benefit to be justi- fied. It made the implementation process a lot simpler by enabling most functions and surrounding code to be reused in the new implementation. In the case of the 22 3. Methods version intended for exporting the model, the work was also deemed to be simpler compared to doing a combined implementation with both training and exporting. Only code relating to the export process had to be modified, saving a lot of time and effort. The sparsed and quantized models were then validated using DeepSparse. Since DeepSparse, similar to SparseML, has built-in support for YOLOv8 but does not inherently support the pose task, this had to be implemented. This involved the following adjustments in more detail: • Validation modifications The pose functionality had to be implemented into DeepSparse (version 1.7.1) to enable validation of the sparsed and/or quantized pose model. This version of DeepSparse depends on an Ultralytics version (8.0.124), which includes the pose estimation task. Therefore, an augmented version of DeepSparse was created, similar to the one made with SparseML. Specifically, this involved overwriting the validation class with a modified version, leveraging the existing pose-specific functionality in Ultralytics. 3.1.3 Sparsification recipes selection Since no pre-sparsed pose models existed, the model had to be optimized from scratch through the use of sparsification recipes in SparseML. The choice of recipes was in turn influenced by several factors, with one being the existence of recipes for related models. SparseZoo hosts several pre-sparsed models along with the sparsifi- cation recipe used, and this includes versions of the YOLOv8 detect model. These models are nearly identical in structure to the YOLOv8 pose models, only differing by a few layers in the head, with some parts also differing in parameter count. These recipes could therefore be used for the pose model without any modifications. The parameters defined in these recipes have been optimized for the YOLOv8 detect model, containing a non-uniform distribution of pruning among the layers. This made it challenging to create comparable recipes for the pose model from scratch. Since the methodology for choosing these parameters is not available, our approach would have had to be experiment-based. Realistically, only a limited number of custom recipes could have been explored within the time frame, potentially limiting the model’s results. Although a custom recipe using a uniform pruning distribution was explored briefly, the results seemed inferior and were not explored further. Due to the challenges associated with creating custom recipes, it was decided that the detect recipes were to be used as a base for all experiments. A consequence of this was that the recipes had to be limited to the sparsity levels already present in SparseZoo. This involved the sparsity levels 50%, 55%, 65%, and 70%, achieved using a magnitude-based method, which were included in the evaluation. Some modified versions of recipes were also included in the final evaluation. These 23 3. Methods Table 3.1: Overview of the models trained with SparseML. The models are denoted with "sml" to indicate their use of SparseML, "p..." to define their sparsity level, and "int8" to specify quantization. The recipes for these models are sourced from SparseZoo, with an "m" denoting any modifications. Models Sparsity level (%) Numeric precision Note ¢¢¢ sml-base 0 fp32 sml-int8 0 int8 Quantized base model. sml-p50 50 fp32 sml-p50_int8 50 int8 sml-p55 55 fp32 sml-p55_int8 55 int8 sml-p55_int8-m 55 int8 Pose head sparsified. sml-p65 65 fp32 sml-p65_int8 65 int8 sml-p70 70 fp32 sml-p70_int8 70 int8 sml-p70_int8-m 70 int8 Pose head sparsified. included pruning in the pose-specific layers and tried to mimic the pruning distribu- tion present in the original recipes. Pruning and quantization were also evaluated independently to attempt to demonstrate their individual impact on the model. The quantization method used for all quantized recipes was QAT. The final recipes evaluated are presented in Table 3.1. All models were trained on the COCO pose dataset using consistent hyperparameters, including Ultralytics’ default batch size of 16 and a fixed image size of 512x512. However, certain parameters such as epoch and learning rate differ based on the compression techniques being applied. 3.2 GPU evaluation The GPU experiments were mainly performed utilizing TensorRT. The use of Ten- sorRT was decided to be a fair addition for the comparison since it represents the current state-of-the-art GPU technology in terms of both hardware and software. This choice ensures fairness in the comparison, especially considering that optimiza- tion techniques were allowed and evaluated on the CPU side. Without the inclusion of TensorRT, some of the chosen hardware’s capabilities would go unutilized and performance would be left on the table. Besides speeding up the base model, Ten- sorRT also includes optimization techniques of its own which will also be taken into consideration. But even though TensorRT was deemed to be a fair inclusion, an indication of base- line GPU performance was needed for comparison. Therefore, ONNX Runtime was also evaluated and included and served as a GPU base-line when Tensor Cores were not utilized. ONNX was chosen since it generally provides competitive performance on a wide range of hardware. 24 3. Methods Table 3.2: Overview of models converted to TensorRT and ONNX formats, indi- cated by "trt" and "onnx" respectively, followed by the numeric precision.¢ Models Numeric precision Note trt-base fp32 trt-fp16 fp16 Applied using PTQ. trt-int8 int8 Applied using PTQ. onnx fp32 Base model in ONNX format. 3.2.1 Hardware The hardware chosen for the GPU experiments was a NVIDIA Jetson AGX Orin (64GB) running at the 50W (watt) power mode. This was determined to be a suit- able option with its inclusion of Tensor Cores and support for current versions of TensorRT. Availability was also a factor. A unit was allotted for doing the experi- ments, therefore no other hardware was needed to perform the experiments. 3.2.2 Model conversion process In order to use the TensorRT runtime for the evaluation, the base model had to be converted to a compatible format. This was achieved through a command-line wrapper called trtexec [42], provided by NVIDIA. The model was exported with varying levels of quantization to be included in the comparison. Similarly, for ONNX Runtime, the base model was converted to ONNX format using Ultralytics. The final model conversions are specified in Table 3.2. 3.3 Validation setup This section outlines the methods used to evaluate the performance of the models. The evaluation focused on two main metrics: predictive performance and inference time. Given the differing hardware used as well as the variation in optimization techniques applied, a setup that ensured a fair comparison was imperative. For evaluating the predictive performance, two modified versions of the pose vali- dation function from Ultralytics were created. This validation function was chosen as a starting point as it was known to work as intended and minimized the risk of the results being affected by the implementation. The different output formats of the different runtimes necessitated the creation of these separate versions, one for TensorRT and the other for SparseML. These two versions were used for the evaluation of all models except the one using the ONNX runtime. In the case of ONNX, integration with Ultralytics was already implemented, so no modifications were needed. The COCO pose dataset was used for all validation. As a result of using the Ultralytics validation function, metrics such as precision, recall, and mean Average Precision (mAP), for both bounding boxes and poses were 25 3. Methods obtained. This included both mAP50 and mAP50-95, as described in Section 2.9. These metrics yielded a comprehensive assessment of the models, considering both the model’s ability to predict correctly as well as to detect all present instances. The second validation setup involved measuring the inference speed of the models for the purpose of quantifying speed-ups. This evaluation also required two distinct setups to account for the runtime differences, but these functioned in fundamentally the same way. Time measurements were done using the tqdm python package [43]. All experiments measured only the actual inference time, excluding latency related to pre-processing and post-processing. Warm-up iterations were incorporated into all runs. After the warm-up iterations, the timings of the next 2000 inputs (batch size 1) were collected and averaged to get the final result. A combined evaluation of these two metrics – predictive performance and inference time – was included to identify models that have been optimized for speed without significant loss of predictive performance. The predictive performance was mea- sured using mAP50 due to its inclusion of both precision and recall, providing an evaluation of the model’s performance. 26 4 Results The results are presented from two distinct angles to highlight the main aspects con- sidered by the experiments: predictive performance and inference time. The first section will showcase the precision and recall impact of the different optimization techniques evaluated using a variety of setup configurations. Each configuration consists of a unique combination of optimization techniques and compression rates. The second section will showcase the computational speed-up of these configura- tions of optimization on their respective hardware. The third section will include a combined evaluation of both predictive performance and inference time. 4.1 Predictive performance This section presents the predictive performance results of the different configura- tions without considering the hardware. The reason for this is that the model is deterministic given that the weights are fixed. Therefore, the model’s behavior is identical across all hardware as long as the model can run on it. The impact on predictive performance is visualized through different graphs, incor- porating the results from validating the models listed in table 3.1 and 3.2. These include the models trained with SparseML as well as those exported with various numerical resolutions using TensorRT. Figure 4.1 shows the precision for both the bounding box and keypoint predictions for each model respectively. From this, it is clear that the quantized int8 models without pruning (sml-int8 and trt-int8) yield the biggest drop in precision compared to the base model, while the model with half-precision (trt-fp16) shows close to no difference. Among the other models that include both pruning and quantization, sml-p55, sml-p65, and sml-p55-int8 are the ones that stand out. For these models, the precision increases for the box predictions, while there only is a small drop in pose precision when compared to the base case. A similar trend can be seen across the rest of the graphs in this section. Severe performance drops for models purely quantized to int8 precision on both CPU and GPU while the model quantized to fp16 performs nearly identical to the base model 27 4. Results Figure 4.1: Comparison of precision for both pose and box predictions between the optimized models and the base case. Figure 4.2: Comparison of recall for both pose and box predictions between the optimized models and the base case. 28 4. Results Figure 4.3: Comparison of mAP at an IoU/OKS threshold of 50% for pose and box predictions between the optimized models and the base case. across the board (Figure 4.2-4.4). The slight performance improvements seen by the sml-p55 and sml-p55-int8 models in the precision plot (Figure 4.1) do not carry over to the other plots. Their performance seems to fluctuate around the baseline, with no single model outperforming the base model in all metrics. In the recall graph (Figure 4.2), sml-p70 saw a marginal performance improvement while all other models performed slightly below baseline in regards to both pose and box recall. Figures 4.3 and 4.4 visualize the results of the mAP with a threshold of 0.5 and the average with a threshold between 0.5-0.95, respectively. It is noteworthy that the pruned models, including quantization, are most affected by the uncertain predic- tions, as seen in Figure 4.4. An overview of the pose metric results is presented in Table 4.1. Overall, the trt-fp16 model has been the least affected model in comparison to the base model. 29 4. Results Figure 4.4: Comparison of mAP across IoU/OKS thresholds ranging from 50% to 95% for both pose and box predictions between the optimized models and the base case. Table 4.1: Overview of the models’ predictive performance, including precision (P), recall (R), and mAP for both box and pose predictions. Box Pose P R mAP50 mAP50-95 P R mAP50 mAP50-95 sml-base 0.873 0.838 0.920 0.721 0.859 0.776* 0.837 0.575 sml-int8 0.823 0.714 0.824 0.562 0.743 0.562 0.586 0.256 sml-p50 0.878 0.810 0.910 0.701 0.850 0.739 0.796 0.502 sml-p50-int8 0.876 0.827 0.914 0.705 0.839 0.759 0.806 0.510 sml-p55 0.882 0.829 0.917 0.715 0.857 0.765 0.823 0.548 sml-p55-int8 0.885 0.819 0.917 0.710 0.853 0.761 0.817 0.522 sml-p55-int8-m 0.870 0.821 0.909 0.699 0.837 0.742 0.796 0.486 sml-p65 0.884 0.815 0.910 0.702 0.840 0.751 0.799 0.505 sml-p65-int8 0.864 0.830 0.911 0.699 0.847 0.751 0.803 0.499 sml-p70 0.873 0.848 0.919 0.716 0.833 0.784 0.822 0.552 sml-p70-int8 0.864 0.840 0.917 0.710 0.848 0.759 0.818 0.522 sml-p70-int8-m 0.873 0.827 0.912 0.702 0.842 0.757 0.810 0.497 trt-base 0.873 0.838 0.920 0.721 0.859 0.777* 0.837 0.575 trt-fp16 0.875 0.836 0.919 0.721 0.858 0.776 0.837 0.574 trt-int8 0.820 0.654 0.778 0.447 0.810 0.601 0.681 0.406 onnx 0.873 0.838 0.920 0.721 0.859 0.776* 0.837 0.575 * The base cases differs slightly, likely due to variations in number conversions across different runtimes. 30 4. Results Figure 4.5: Comparison of the models’ inference times across different hardware and core configurations. The SparseML models are evaluated on an AMD Genoa with 8-, 15-, and 30-core configurations, while the inference times for TensorRT and ONNX Runtime are evaluated on a NVIDIA Jetson AGX Orin. 4.2 Inference time The inference time is evaluated across various CPU hardware configurations to com- pare performance against the selected GPU hardware, NVIDIA Jetson AGX Orin. These tests involve the SparseML models listed in Table 3.1 for the CPU tests, and the TensorRT models listed in Table 3.2 for testing on the GPU. Figure 4.5 shows the inference time of one item/image using AMD Genoa instances on Google Cloud, comparing 8-core, 15-core, and 30-core configurations. The re- sults show improvements in inference time across all optimizations applied, with the quantized models demonstrating the most significant improvement. The exact re- sults are presented in Table 4.2. Overall, the sml-p55-int8 yielded the best inference times for the 8-core setup, while the sml-int8 models performed best for both the 15-core and 30-core configurations. It is worth noting that the modified versions, which include a sparsified head, perform approximately the same as the versions without modifications. The inference times running the TensorRT models on GPU are also presented in Figure 4.5. Across all hardware setups, no model outperforms the trt-int8 model. However, running the sml-int8 model on the 30-core configuration resulted in a performance that was exceptionally close (0.028 ms) to that of the trt-int8 model. 4.3 Combined evaluation This section showcases the combined results of the experiments in order to put the different results into context. This is done using graphs that showcase the relative 31 4. Results Table 4.2: Overview of the inference times in milliseconds across various hardware configurations, including AMD Genoa with 8-, 15-, and 30-core configurations, as well as NVIDIA Jetson AGX Orin (64GB). 8-core (ms) 15-core (ms) 30-core (ms) GPU (ms) sml-base 2.759 2.037 1.269 - sml-int8 0.778 0.629 0.399 - sml-p50 2.272 1.772 1.139 - sml-p50-int8 0.892 0.728 0.508 - sml-p55 2.204 1.733 1.106 - sml-p55-int8 0.762 0.671 0.483 - sml-p55-int8-m 0.765 0.664 0.480 - sml-p65 1.845 1.471 0.933 - sml-p65-int8 0.830 0.745 0.537 - sml-p70 1.893 1.565 1.036 - sml-p70-int8 0.984 0.894 0.651 - sml-p70-int8-m 1.007 0.892 0.650 - trt-base - - - 0.714 trt-fp16 - - - 0.509 trt-int8 - - - 0.371 onnx - - - 1.616 performance vs speed-up of each data point. Due to its poor performance in inference time (see Figure 4.5), onnx was not included in these evaluations. For these plots, mAP50 was chosen as an approximation of the overall predictive performance of the model, with inference time representing the execution speed. This is helpful to visualize since there is a trade-off associated with model compres- sion. The increases in speed can come at the cost of predictive performance and the relationship between the two is not necessarily linear. The results are visualized in Figure 4.6, with the CPU data based on the highest-end hardware (30 cores). As can be seen in Figure 4.6, the models optimized using the modified recipes per- formed nearly the same as their unmodified counterparts. Another notable detail was that the models based on SparseML recipes without quantization were consid- erably slower than the ones using both pruning and quantization with only a small degradation in mAP50. The un-optimized model running on the CPU was also the slowest as expected. In Figure 4.7, these data points will be excluded as they are not competitive alternatives regarding inference speed. Instead, data from the different CPU hardware will be included to showcase the effect hardware has on performance. In Figure 4.7, the 30-core CPU at 55% pruning can be seen outperforming the GPU using half-precision in terms of inference speed, at a comparable level of predictive performance. Beyond half-precision, GPU sees a considerable drop in mAP50 and the same can be said for the purely pruned SparseML models. 32 4. Results Figure 4.6: Results of pose mAP over inference time for at an OKS threshold of 50%. The SparseML models are evaluated on an AMD Genoa 30-core processor, while the TensorRT models are evaluated on a NVIDIA Jetson AGX Orin. Figure 4.7: Results of pose mAP over inference time at an OKS threshold of 50% across all hardware configurations. This includes the TensorRT models, the quantized SparseML models, as well as the pruned and quantized SparseML models. 33 4. Results 34 5 Discussion This chapter discusses the findings of the results. It includes an evaluation of critical methodological and background aspects that could have impacted the results, along with an analysis of the results and suggestions for future work. 5.1 Method and background There are a lot of factors that can influence the effectiveness of compression tech- niques and this could have had an impact on the results. Since a comprehensive study was not possible due to the time constraints of the thesis, limitations had to be made. The creation of our own implementation would have been interesting to explore but was not feasible due to the time horizon of the project in combination with the small likelihood of achieving superior results. This would also present chal- lenges when it comes to optimized inference. Either, our optimization implementa- tion would have had to be made compatible with DeepSparse or an implementation of our own optimized runtime would have had to be made. This implementation, in turn, would also be limited by time constraints and the low feasibility of achieving superior results. The adaptation of SparseML for use with pose models also proved to be more chal- lenging than first expected, meaning that the scope of the evaluation had to be reduced considerably. These challenges were a result of the unexpectedly large amount of modifications needed to implement functionality in combination with an undiscovered bug associated with quantization that caused further delays. Without these hurdles, a more comprehensive comparison would have been feasible. The limited selection of the sparsification recipes could also have had a negative impact on our results. The fact that the base of these recipes was intended for a different task, which was detection, could also have had an affect. The development of new recipes tailored to our model and task could potentially have led to better results, both in terms of predictive performance and inference time. Therefore, the results may not reflect the ideal case. Other avenues could also have been explored to potentially improve the results. One 35 5. Discussion of these was the exploration of a wider range of sparsity levels, but the impact likely would have been minimal based on the obtained results. Additional pruning algo- rithms could also have been evaluated, which potentially could have strengthened the reliability of the results. GPU implementation was more painless in comparison, although it still required some work. While we did explore built-in optimization methods, they did not yield notable results. However, there may be unexplored methods that could have been included and further evaluated for better results. 5.2 Key findings analysis The results indicate that the TensorRT model quantized to 16-bit floating point precision (trt-fp16), and the SparseML model with 55% sparsity and quantized with int8 precision (sml-p55-int8) show the most promising results in terms of precision and latency, with sml-p55-int8 being slightly faster. This demonstrates the potential for CPUs to outperform GPUs in certain configurations. The case of sml-p55-int8 is particularly surprising as it does not have the highest pruning level which goes against previous expectations of higher pruning rates result- ing in faster inference. It is difficult to definitively determine the actual reason for this result, as there are many factors that could be at play. These include the com- pression techniques applied to the models, how DeepSparse leverages them, and how the resulting network structure interacts with the CPU. Our speculations are that a more extensive reduction in model size might lead to decreased computational efficiency, with high computational overhead due to increased sparsity. However, when examining models pruned without quantization, we observe that higher levels of sparsity generally result in reduced inference times, although the SparseML model with 65% sparsity (sml-p65) still outperforms the one with 70% sparsity (sml-p70). This suggests that there may be a threshold beyond which increased sparsity slows down the model. Interestingly, this effect seems more pronounced when quantization is applied. Another unexpected finding was that the inference time of sml-p55-int8 was faster than that of the base model quantized with int8 precision (sml-int8) on an 8-core CPU setup. This implies that for smaller CPU configurations, there might be greater benefits in speed by combining quantization and pruning. It underscores the impor- tance of considering the selected hardware when determining the compression rates for the model. Furthermore, our observations revealed a noteworthy trend where models with spar- sified heads resulted in a slightly lower mAP50 compared to those without sparsified heads. This outcome may be due to the significantly smaller number of parameters in the head, which could mean a higher concentration of important weights, that have a detrimental impact on the average precision once removed. Additionally, the outcome could have been different if the sparsification recipes were optimized 36 5. Discussion specifically for our model. Finding a connection between predictive performance and sparsity level is challeng- ing, as no clear trend can be seen. All models fluctuate around the base-line with no model outperforming the others across the board. Furthermore, the variability in how pruning affects the resulting network structure, especially across different layers, contributes to the observed variation in model performance across different sparsity levels. A drop in predictive performance was observed across both hardware for the models purely quantized using int8. Since it also occurs with the use of TensorRT, it likely is not a result of compression or implementation, but rather a characteristic of our model. This could potentially be due to a higher sensitivity inherent to the pose task itself. Another interesting finding is that the TensorRT model with full precision (trt- base) is outperformed by sml-p55-int8 even on a CPU with 15-core configuration. However, given the minimal difference in predictive performance between trt-base and trt-fp16, there appears to be little justification for not using the latter. Based on our observed results, it is evident that neither NVIDIA nor Neural Magic outperforms the other in all cases of this specific scenario. This brings attention to a new consideration: cost, which could ultimately determine the choice. It is worth noting that the cost of the CPU with a 32-core configuration is approximately equivalent to that of purchasing the NVIDIA Jetson Orin hardware. This makes the 32-core CPU the more expensive choice since the rest of the required components are excluded from its price. If fast inference is not a critical necessity, it could open the door for other, more affordable alternatives where a lower core configuration can be utilized. Otherwise, it may be difficult to justify the process of using SparseML and DeepSparse versus just converting the model to TensorRT, particularly in regards to simplicity. 5.3 Limitations Due to the limited scope of this thesis, the generality of the results has to be con- sidered. With the experiments being limited to a single model for a single use case, very little can be said about the performance or behavior in other domains. There is nothing to say that the findings would carry over to other network structures since the results could be unique to our situation. However, the results hopefully generalize for CNNs within latency-critical domains, which include a wide range of applications. The reliability of the evaluation data itself can also be called into question since the results could be heavily tied to the testing conditions of the experiments, with some surrounding factors not being considered. Thermals and power consumption for example were hard to account for due to the limited control over the testing 37 5. Discussion setup running in the cloud. Therefore, the performance results may not realistically reflect the real-world performance of the model on the given hardware. Another limiting factor is the dataset is needed during pruning. The dataset used can potentially influence the effectiveness of pruning considerably. In situations without an available dataset, the model would be restricted to the use of one-shot pruning. This could lead to completely different model behaviour, making our results unrepresentative in this scenario. 5.4 Ethical considerations There are ethical aspects that have to be considered when it comes to the use and deployment of compressed models, specifically for the case of pruning. Since the effects of pruning is not yet fully understood, its utilization could alter the models behaviour in unforeseen ways, resulting in some degree of uncertainty. This can make compressed models unsuitable for mission-critical applications. In applications like healthcare and automotive, where consequences can be catas- trophic, the feasibility of potentially sacrificing average precision and certainty, for the sake of performance should be questioned. At the same time, too low inference performance can be a risk in itself. In some cases, like self-driving cars, slow pro- cessing could prove fatal. But in these cases, investing in more capable hardware might be more appropriate, although it may depend on the situation. It should however be noted that the use of pruning does not always have to result in worse predictive performance and could, in some instances, even improve it. 5.5 Future work In terms of future work, further investigation into optimizing sparsification recipes could yield valuable insights. Exploring new approaches to designing and fine-tuning these recipes could enhance the efficiency and effectiveness of model optimization techniques. Additionally, evaluating the performance of sparsification recipes across a broader range of hardware configurations and machine learning tasks could provide a more comprehensive understanding of potential use-cases and limitations. 38 6 Conclusion This thesis set out to explore if CPUs could be a competitive alternative to modern GPU solutions on a real-time, latency-critical vision task. The vagueness surround- ing many of the publicly available performance results in the domain, with many being incomprehensive and not suitable for cross-hardware comparison, was one of the main reasons for this exploration. Despite the limited scope of this work, the findings do suggest that CPUs can be a competitive alternative for low latency machine vision tasks, even at a comparable level of predictive performance. There- fore, when the situation allows for it, CPUs could be considered as a competitive alternative, although in which cases they are preferable remains inconclusive. 39 6. Conclusion 40 References [1] T. B. Brown, B. Mann, N. Ryder, et al., Language models are few-shot learners, 2020. arXiv: 2005.14165 [cs.CL]. [2] NVIDIA Corporation. “Nvidia announces financial results for fourth quar- ter and fiscal 2024.” [Accessed: May 31, 2024]. (Feb. 2024), [Online]. Avail- able: https://investor.nvidia.com/news/press- release- details/ 2024/NVIDIA-Announces-Financial-Results-for-Fourth-Quarter-and- Fiscal-2024/. [3] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Vision and challenges,” IEEE Internet of Things Journal, vol. 3, no. 5, pp. 637–646, 2016. doi: 10.1109/JIOT.2016.2579198. [4] Y. Lecun, J. Denker, and S. Solla, “Optimal brain damage,” vol. 2, Jan. 1989, pp. 598–605. [5] M. Zhu and S. Gupta, To prune, or not to prune: Exploring the efficacy of pruning for model compression, 2017. arXiv: 1710.01878 [stat.ML]. [6] S. Chen and Q. Zhao, “Shallowing deep networks: Layer-wise pruning based on feature representations,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 12, pp. 3048–3056, 2019. doi: 10.1109/TPAMI.2018. 2874634. [7] B. Liu, Y. Cai, Y. Guo, and X. Chen, Transtailor: Pruning the pre-trained model for improved transfer learning, 2021. arXiv: 2103.01542 [cs.CV]. [8] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, Pruning convo- lutional neural networks for resource efficient inference, 2017. arXiv: 1611. 06440 [cs.LG]. [9] neuralmagic. “Part 1: What is pruning in machine learning?” (2020), [Online]. Available: https://web.archive.org/web/20230130051206/https:// neuralmagic.com/blog/pruning-overview/ (visited on 04/05/2024). [10] Neural Magic, Sparseml: A library for sparse model training and optimiza- tion, version 1.6.1, Accessed: February 1, 2024. [Online]. Available: https: //github.com/neuralmagic/sparseml. [11] Google Cloud. “Introduction to Cloud TPU.” (Accessed: 2024), [Online]. Avail- able: https://cloud.google.com/tpu/docs/intro-to-tpu. [12] Y. E. Wang, G.-Y. Wei, and D. Brooks, Benchmarking tpu, gpu, and cpu platforms for deep learning, 2019. arXiv: 1907.10701 [cs.LG]. 41 https://arxiv.org/abs/2005.14165 https://investor.nvidia.com/news/press-release-details/2024/NVIDIA-Announces-Financial-Results-for-Fourth-Quarter-and-Fiscal-2024/ https://investor.nvidia.com/news/press-release-details/2024/NVIDIA-Announces-Financial-Results-for-Fourth-Quarter-and-Fiscal-2024/ https://investor.nvidia.com/news/press-release-details/2024/NVIDIA-Announces-Financial-Results-for-Fourth-Quarter-and-Fiscal-2024/ https://doi.org/10.1109/JIOT.2016.2579198 https://arxiv.org/abs/1710.01878 https://doi.org/10.1109/TPAMI.2018.2874634 https://doi.org/10.1109/TPAMI.2018.2874634 https://arxiv.org/abs/2103.01542 https://arxiv.org/abs/1611.06440 https://arxiv.org/abs/1611.06440 https://web.archive.org/web/20230130051206/https://neuralmagic.com/blog/pruning-overview/ https://web.archive.org/web/20230130051206/https://neuralmagic.com/blog/pruning-overview/ https://github.com/neuralmagic/sparseml https://github.com/neuralmagic/sparseml https://cloud.google.com/tpu/docs/intro-to-tpu https://arxiv.org/abs/1907.10701 6. Conclusion [13] NVIDIA Corporation. “Nvidia tensor cores.” Accessed: April 18, 2024. (), [On- line]. Available: https://www.nvidia.com/en-us/data-center/tensor- cores/. (Accessed: April 18, 2024). [14] NVIDIA Corporation, Tensorrt open source software, version 8.6, Accessed: February 1, 2024. [Online]. Available: https://github.com/NVIDIA/TensorRT. [15] Neural Magic, YOLOv8 Detection 10x Faster with DeepSparse: 500 FPS on a CPU, https://neuralmagic.com/blog/yolov8-detection-10x-faster- with-deepsparse-500-fps-on-a-cpu/, Accessed: May 3, 2024, 2023. [16] NVIDIA Corporation, NVIDIA TensorRT, https :/ / developer. nvidia . com/tensorrt?spm=a2c6h.13046898.publish-article.26.7fe06ffaBIjVNA, Accessed: May 3, 2024, 2024. [17] NVIDIA Corporation, Nvidia a100 tensor core gpu, Accessed: 2024-06-07, 2024. [Online]. Available: https://www.nvidia.com/en-us/data-center/ a100/. [18] Neural Magic, Fastest Software-Delivered AI on CPUs - DeepSparse, https: //neuralmagic.com/deepsparse/, [Accessed 01-02-2024], 2023. [19] J. Pool, A. Sawarkar, and J. Rodge, Accelerating inference with sparsity using the nvidia ampere architecture and nvidia tensorrt, https : / / developer . nvidia . com / blog / accelerating - inference - with - sparsity - using - ampere-and-tensorrt/, Accessed: 2024-06-05, 2021. [20] B. Mehlig, Machine Learning with Neural Networks: An Introduction for Scien- tists and Engineers. Cambridge University Press, Oct. 2021, isbn: 9781108494939. doi: 10.1017/9781108860604. [Online]. Available: http://dx.doi.org/10. 1017/9781108860604. [21] E. Fladmark, M. H. Sajjad, and L. B. Justesen, Exploring the performance of pruning methods in neural networks: An empirical study of the lottery ticket hypothesis, 2023. arXiv: 2303.15479 [cs.LG]. [22] M. Augasta and T. Kathirvalavakumar, Open Computer Science, vol. 3, no. 3, pp. 105–115, 2013. doi: doi:10.2478/s13537-013-0109-x. [Online]. Avail- able: https://doi.org/10.2478/s13537-013-0109-x. [23] B. Hassibi and D. Stork, “Second order derivatives for network pruning: Opti- mal brain surgeon,” in Advances in Neural Information Processing Systems, S. Hanson, J. Cowan, and C. Giles, Eds., vol. 5, Morgan-Kaufmann, 1992. [On- line]. Available: https://proceedings.neurips.cc/paper_files/paper/ 1992/file/303ed4c69846ab36c2904d3ba8573050-Paper.pdf. [24] Z. Yang and H. Zhang, “Comparative analysis of structured pruning and un- structured pruning,” in Frontier Computing, J. C. Hung, N. Y. Yen, and J.-W. Chang, Eds., Singapore: Springer Nature Singapore, 2022, pp. 882–889, isbn: 978-981-16-8052-6. [25] Y. He and L. Xiao, “Structured pruning for deep convolutional neural net- works: A survey,” IEEE Transactions on Pattern Analysis and Machine In- telligence, vol. 46, no. 5, pp. 2900–2919, May 2024, issn: 1939-3539. doi: 10.1109/tpami.2023.3334614. [Online]. Available: http://dx.doi.org/ 10.1109/TPAMI.2023.3334614. [26] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, Pruning filters for efficient convnets, 2017. arXiv: 1608.08710 [cs.CV]. 42 https://www.nvidia.com/en-us/data-center/tensor-cores/ https://www.nvidia.com/en-us/data-center/tensor-cores/ https://github.com/NVIDIA/TensorRT https://neuralmagic.com/blog/yolov8-detection-10x-faster-with-deepsparse-500-fps-on-a-cpu/ https://neuralmagic.com/blog/yolov8-detection-10x-faster-with-deepsparse-500-fps-on-a-cpu/ https://developer.nvidia.com/tensorrt?spm=a2c6h.13046898.publish-article.26.7fe06ffaBIjVNA https://developer.nvidia.com/tensorrt?spm=a2c6h.13046898.publish-article.26.7fe06ffaBIjVNA https://www.nvidia.com/en-us/data-center/a100/ https://www.nvidia.com/en-us/data-center/a100/ https://neuralmagic.com/deepsparse/ https://neuralmagic.com/deepsparse/ https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/ https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/ https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/ https://doi.org/10.1017/9781108860604 http://dx.doi.org/10.1017/9781108860604 http://dx.doi.org/10.1017/9781108860604 https://arxiv.org/abs/2303.15479 https://doi.org/doi:10.2478/s13537-013-0109-x https://doi.org/10.2478/s13537-013-0109-x https://proceedings.neurips.cc/paper_files/paper/1992/file/303ed4c69846ab36c2904d3ba8573050-Paper.pdf https://proceedings.neurips.cc/paper_files/paper/1992/file/303ed4c69846ab36c2904d3ba8573050-Paper.pdf https://doi.org/10.1109/tpami.2023.3334614 http://dx.doi.org/10.1109/TPAMI.2023.3334614 http://dx.doi.org/10.1109/TPAMI.2023.3334614 https://arxiv.org/abs/1608.08710 6. Conclusion [27] J. Frankle and M. Carbin, The lottery ticket hypothesis: Finding sparse, train- able neural networks, 2019. arXiv: 1803.03635 [cs.LG]. [28] I. Lazarevich, A. Kozlov, and N. Malinin, Post-training deep neural network pruning via layer-wise calibration, 2021. arXiv: 2104.15023 [cs.CV]. [29] S. Han, H. Mao, and W. J. Dally, Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, 2016. arXiv: 1510.00149 [cs.CV]. [30] B. Jacob, S. Kligys, B. Chen, et al., Quantization and training of neural net- works for efficient integer-arithmetic-only inference, 2017. arXiv: 1712.05877 [cs.LG]. [31] Neural Magic, Sparsezoo: Neural network model repository for highly sparse and sparse-quantized models with matching sparsification recipes, version 1.7.0, Accessed: April 5, 2024. [Online]. Available: https://github.com/neuralmagic/ sparsezoo. [32] NVIDIA V100 TENSOR CORE GPU, NVIDIA, 2020. [Online]. Available: https://images.nvidia.com/content/technologies/volta/pdf/volta- v100-datasheet-update-us-1165301-r5.pdf. [33] NVIDIA Corporation, Convolutional layers user’s guide, Accessed: 2024-06- 11, 2024. [Online]. Available: https://docs.nvidia.com/deeplearning/ performance/dl-performance-convolutional/index.html. [34] Y. Zhou and K. Yang, “Exploring tensorrt to improve real-time inference for deep learning,” in 2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys), 2022, pp. 2011– 2018. doi: 10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00299. [35] ONNX, Onnx: About, https://onnx.ai/about.html, Accessed: 2024-06-05, 2024. [36] T.-Y. Lin, M. Maire, S. Belongie, et al., Microsoft coco: Common objects in context, 2015. arXiv: 1405.0312 [cs.CV]. [37] F. Chamaki, Woman selecting packed food on gondola, [Online; accessed 5- June-2024], 2021. [Online]. Available: https : / / unsplash . com / photos / woman-selecting-packed-food-on-gondola-YNaSz-E7Qss. [38] G. Jocher, A. Chaurasia, and J. Qiu, Ultralytics YOLO, version 8.0.0, Jan. 2023. [Online]. Available: https://github.com/ultralytics/ultralytics. [39] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” CoRR, vol. abs/1506.02640, 2015. arXiv: 1506.02640. [Online]. Available: http://arxiv.org/abs/1506.02640. [40] J. Doe, Four-person walking near vehicle, [Online; accessed 5-June-2024], 2021. [Online]. Available: https://unsplash.com/photos/four-person-walking- near-vehicle-H-j7KH1gjP4. [41] I. Neural Magic, Deepsparse hardware support, Accessed: 2024-05-31, 2024. [Online]. Available: https://github.com/neuralmagic/deepsparse/blob/ main/docs/user-guide/hardware-support.md. [42] NVIDIA Corporation, TensorRT: NVIDIA’s Deep Learning Inference Toolkit, https://github.com/NVIDIA/TensorRT, Accessed: Mars 2024, 2024. 43 https://arxiv.org/abs/1803.03635 https://arxiv.org/abs/2104.15023 https://arxiv.org/abs/1510.00149 https://arxiv.org/abs/1712.05877 https://arxiv.org/abs/1712.05877 https://github.com/neuralmagic/sparsezoo https://github.com/neuralmagic/sparsezoo https://images.nvidia.com/content/technologies/volta/pdf/volta-v100-datasheet-update-us-1165301-r5.pdf https://images.nvidia.com/content/technologies/volta/pdf/volta-v100-datasheet-update-us-1165301-r5.pdf https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00299 https://onnx.ai/about.html https://arxiv.org/abs/1405.0312 https://unsplash.com/photos/woman-selecting-packed-food-on-gondola-YNaSz-E7Qss https://unsplash.com/photos/woman-selecting-packed-food-on-gondola-YNaSz-E7Qss https://github.com/ultralytics/ultralytics https://arxiv.org/abs/1506.02640 http://arxiv.org/abs/1506.02640 https://unsplash.com/photos/four-person-walking-near-vehicle-H-j7KH1gjP4 https://unsplash.com/photos/four-person-walking-near-vehicle-H-j7KH1gjP4 https://github.com/neuralmagic/deepsparse/blob/main/docs/user-guide/hardware-support.md https://github.com/neuralmagic/deepsparse/blob/main/docs/user-guide/hardware-support.md https://github.com/NVIDIA/TensorRT 6. Conclusion [43] N. Yorav-Raphael and C. da Costa-Luis, Tqdm: A fast, extensible progress bar for python and cli, https://github.com/tqdm/tqdm, Version 4.65.0, 2024. 44 https://github.com/tqdm/tqdm DEPARTMENT OF PHYSICS CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden www.chalmers.se www.chalmers.se List of Acronyms List of Figures List of Tables Introduction Background Gap Aim Limitations Research Questions Theory Convolutional Neural Networks Pruning Unstructured pruning Structured pruning Pruning-related fine-tuning Post-training pruning Quantization Post Training Quantization (PTQ) Quantization Aware Training (QAT) SparseML DeepSparse TensorRT ONNX Runtime Pose Metrics YOLOv8 Pose Methods CPU evaluation Hardware Implementation Sparsification recipes selection GPU evaluation Hardware Model conversion process Validation setup Results Predictive performance Inference time Combined evaluation Discussion Method and background Key findings analysis Limitations Ethical considerations Future work Conclusion References