Hardware Acceleration of Machine Learning Evaluation and comparison of different hardware-aware opti- mization techniques Master’s thesis in Computer Science and Engineering Fangzhou Chen William Sköld Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2023 Master’s thesis 2023 Hardware Acceleration of Machine Learning Evaluation and comparison of different hardware-aware optimization techniques Fangzhou Chen William Sköld Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2023 Hardware Acceleration of Machine Learning Evaluation and comparison of different hardware-aware optimization techniques Fangzhou Chen, William Sköld © Fangzhou Chen, William Sköld, 2023. Supervisor: Pedro Petersen Moura Trancoso, Department of Computer Science and Engineering Advisor: Evangelos Siminos, Volvo Group (SML) Examiner: Pedro Petersen Moura Trancoso, Department of Computer Science and Engineering Master’s Thesis 2023 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: A running megatron, generated by DALLE2. Typeset in LATEX Gothenburg, Sweden 2023 iv Hardware Acceleration of Machine Learning Fangzhou Chen William Skold Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract The Transformer architecture has been widely used in various fields, as demon- strated by GPT-3, a large language model that shows impressive performance. How- ever, achieving such excellent performance requires high computational capabilities. Therefore, improving the computational power of current machine learning systems is of great importance. This thesis aims to optimize and accelerate fine-tuning of Transformer-based models while taking into account several evaluation criteria, such as training time, energy consumption, cost, and hardware utilization. Additionally, a comparison is made between GPU training settings and specialized AI accelerators, such as TPU training settings. In our study, a high-performance kernel for the Adan optimizer was introduced, and the LightSeq library is applied to accelerate existing Transformer components. We also introduce mixed precision training into our workflow and compare all these optimization techniques step by step with baseline performance. In addition, our analysis includes distributed training with multiple GPUs, and a backpropagation time estimation algorithm is introduced. Next, Google’s TPU accelerator is used to run our task, and its performance is compared to the similar GPU setup used in our study. Finally, the advantages and disadvantages of different methods are systematically analyzed, while training on V100, A100, A10 and T4 with different configurations. Meanwhile, the workflow between GPUs and TPUs is analyzed, illustrating the pros and cons of different accelerators. Various weights for measuring optimization methods based on time, energy consump- tion, cost, and hardware utilization are proposed. Our analysis shows that optimal scores in all metrics can be achieved by implementing the optimized LightSeq model, kernel fusion for the Adan optimizer, and enabling mixed precision training. While training with TPU offers certain advantages, such as large batch sizes when loading training data, the ease of use, reliability, and software stability of GPU training surpasses that of TPU training. Keywords: Transformer, GPU, Distributed, Energy consumption, Fine-tuning, TPU. v Acknowledgements We would like to express our sincere gratitude to our supervisors, Evangelos and Pedro for their guidance, expertise and unwavering support throughout this thesis. Their mentorship, advice and constructive feedback have been instrumental in shap- ing the direction and results of this thesis. We would also want to thank Thomas Nordenskjöld, at Volvo for his support and help with all the small things related to collaborating with Volvo for this thesis. Lastly, we’d like to extend our thanks to Volvo Group SML for providing us with the necessary tools for this thesis, such as Microsoft Azure for benchmarking, expertise from employees and workspace. William and Fangzhou, Gothenburg, 2023-06-13 vii Contents List of Figures xiii List of Tables xv 1 Introduction 1 1.1 Machine Learning Hardware . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Machine Learning Framework . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Machine Learning Libraries . . . . . . . . . . . . . . . . . . . 2 1.3 Transformer usage at Volvo . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.6 Aim and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.7 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.8 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Theory 7 2.1 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Self Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Multi-headed Attention . . . . . . . . . . . . . . . . . . . . . . 9 2.1.3 Transformer Architecture . . . . . . . . . . . . . . . . . . . . . 10 2.1.4 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.5 Pre-train and Fine-tuning . . . . . . . . . . . . . . . . . . . . 12 2.2 Deep Learning Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . 13 2.2.2 Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.3 Adam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.4 Adan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Hardware Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.1 GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.2 TPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4 Mixed Precision Training . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5 LightSeq Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.6 Distributed Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.6.1 Topology and Communication . . . . . . . . . . . . . . . . . . 23 2.6.2 Data Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 ix Contents 2.6.3 Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6.4 Collective Communication . . . . . . . . . . . . . . . . . . . . 25 3 Methods 29 3.1 Fused Adan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1.1 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1.2 Kernel fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.1.3 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1.4 Vector Instruction . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1.5 Instruction Level Parallelism . . . . . . . . . . . . . . . . . . . 32 3.2 Multi Tensor Access . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2.1 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2.2 Multi Tensor Apply . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3 Mixed Precision Training . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.1 Enable Mixed Precision Training . . . . . . . . . . . . . . . . 36 3.4 LightSeq Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.5 Distributed Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.5.1 Enabling Data Parallel . . . . . . . . . . . . . . . . . . . . . . 40 3.5.2 Backpropagation Time Estimation . . . . . . . . . . . . . . . 40 3.6 Training on a TPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.6.1 Training Frameworks . . . . . . . . . . . . . . . . . . . . . . . 42 3.7 Systematic Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.7.1 Hardware examined . . . . . . . . . . . . . . . . . . . . . . . . 43 4 Results 45 4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1.2 Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1.4 Training Settings . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.1.5 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 Fused Adan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2.1 Benchmark on Training process . . . . . . . . . . . . . . . . . 48 4.2.2 Benchmark on Optimization process . . . . . . . . . . . . . . 48 4.2.3 Benchmark on Adan fused kernel . . . . . . . . . . . . . . . . 51 4.3 Mixed Precision Training . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4 LightSeq Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.5 Distributed Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.5.1 Data Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.5.2 Time Consumption in One Training Step . . . . . . . . . . . . 58 4.5.3 Backpropagation Time Estimation and Measurement . . . . . 59 4.6 TPU-based training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.7 Overall Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.7.1 TPU-GPU evaluation . . . . . . . . . . . . . . . . . . . . . . . 63 5 Discussion 67 5.1 Comparison between different optimizations . . . . . . . . . . . . . . 67 x Contents 5.1.1 Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2 Volvo Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2.1 Benchmark results . . . . . . . . . . . . . . . . . . . . . . . . 70 5.2.2 Switch to New Datasets . . . . . . . . . . . . . . . . . . . . . 71 5.3 Kernel Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4 Software and Hardware Improvements for ML . . . . . . . . . . . . . 72 5.4.1 Software Improvements . . . . . . . . . . . . . . . . . . . . . . 72 5.4.2 Different Hardware . . . . . . . . . . . . . . . . . . . . . . . . 73 5.4.3 Mixed Precision on GPUs . . . . . . . . . . . . . . . . . . . . 73 5.5 Randomness in Test Accuracy . . . . . . . . . . . . . . . . . . . . . . 75 5.6 TPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.6.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.6.2 TPU-GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.6.3 Unfair Batch Sizes . . . . . . . . . . . . . . . . . . . . . . . . 77 6 Future work 79 6.1 Gradient Compression . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.2 Dcompute and Dcomm Estimation . . . . . . . . . . . . . . . . . . . . . 79 6.3 Performance Improvement on Fused Adan . . . . . . . . . . . . . . . 80 6.4 System Level Energy Measurement . . . . . . . . . . . . . . . . . . . 80 6.5 TPU energy measurements . . . . . . . . . . . . . . . . . . . . . . . . 81 7 Conclusion 83 Bibliography 85 A Appendix 1 I A.1 Fused Adan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I A.2 Benchmark Framework . . . . . . . . . . . . . . . . . . . . . . . . . . I A.3 Overall Evaluation Original Data . . . . . . . . . . . . . . . . . . . . I A.4 LightSeq-Hugging Face Guide . . . . . . . . . . . . . . . . . . . . . . I xi Contents xii List of Figures 2.1 Example of correlation between query vector and key vector. . . . . . 8 2.2 Multi Attention with three attention heads, vector a1 is word embed- ding and b1 headi is the attention vector generated by different attention heads, and these vectors were transformed by the linear layer. . . . . 10 2.3 Transformer Architecture, on the left is stack of encoders, on the right is stack of decoders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 BERTbase for sequence classification . . . . . . . . . . . . . . . . . . . 12 2.5 GPU architecture example . . . . . . . . . . . . . . . . . . . . . . . . 17 2.6 Principle of systolic array . . . . . . . . . . . . . . . . . . . . . . . . 19 2.7 MXU architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.8 Mixed precision training weight update . . . . . . . . . . . . . . . . . 20 2.9 BF16, FP16 and FP32 formats. Different bit ranges for above three floating number representations. . . . . . . . . . . . . . . . . . . . . . 21 2.10 Distributed training topology, on the left is parameter server, on the right is ring organization. . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.11 Data Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.12 Broadcast communication . . . . . . . . . . . . . . . . . . . . . . . . 26 2.13 Allreduce communication . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1 Unfused Adan kernel launch time graph with single tensor access. . . 30 3.2 Multi Tensor Access, with a chunk size of 3. Meaning that the tensors (red, yellow and green) are split into chunks of 3 for more efficient memory accesses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3 1-Step benchmark displaying the GPU’s SM and memory throughput when using Single Tensor Access. . . . . . . . . . . . . . . . . . . . . 35 3.4 Profiling showing the most time consuming GPU kernels, during one training step when using the Hugging Face model without mixed pre- cision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.5 Backpropagation workflow in distributed training. . . . . . . . . . . . 41 4.1 Benchmark 1, Loss and accuracy for unfused and fused Adan. . . . . 49 4.2 Benchmark 2, Fused Adan kernel with single tensor access launch time graph, showing kernel launch is very loosely distributed over the timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3 Benchmark 2, Fused Adan kernel with multi tensor access launch time graph, showing kernel launch is compactly distributed over the timeline 50 xiii List of Figures 4.4 Benchmark 3, GPU pipeline utilization . . . . . . . . . . . . . . . . . 52 4.5 Figure shows the SM and memory throughput of the GPU when using Multi Tensor Access in one-step benchmark. . . . . . . . . . . . . . . 52 4.6 Benchmark 4, Mixed precision training loss and accuracy . . . . . . . 54 4.7 Figure shows the percentages of the top 10 GPU kernels with the highest time consumption during one training step using Hugging Face model with mixed precision. . . . . . . . . . . . . . . . . . . . . 55 4.8 Figure shows the hardware utilization when training the Hugging face model using mixed precision in the one-step benchmark. . . . . . . . 56 4.9 Figure shows the utilization of the hardware in the one-step bench- mark when training the LightSeq model using mixed precision. . . . . 56 4.10 Benchmark 5, FP16 Training loss and accuracy . . . . . . . . . . . . 56 4.11 Benchmark 6, Data Parallel training loss and accuracy . . . . . . . . 58 4.12 Four different plots (A, B, C and D) showing the how the differ- ent hardware compares with the optimization techniques previously shown. Each point is one configuration. Due to TPU energy consump- tion being unavailable in Google Cloud it’s only included in plots not plotting energy on any axis. . . . . . . . . . . . . . . . . . . . . . . . 62 5.1 LightSeq, HuggingFace on V100 and HuggingFace on A100 with batch size of 16, 32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2 Best training time and the cost on different devices.Training configura- tion: LightSeq Encoder(GPUs), Fused Adan(GPUs) and AdamW(TPU), Mixed Precision Training. Batch size 32 using on A10, A100, V100. Batch size 16 using on T4. Batch size 512 using on TPUv2 and TPUv3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3 Different training time and energy when enable and disable mixed precision on GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 xiv List of Tables 3.1 Typical kernel size with their SM/Memory throughput . . . . . . . . 34 3.2 Average execution time of sgemm kernels with different size when not using mixed precision. . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3 Average execution time of common kernels within Hugging Face model. 38 3.4 Notions used for Backpropgation time estimation . . . . . . . . . . . 41 3.5 Showing the parameters used for the different hardware in the sys- tematic benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.6 Showcasing the hardware specifications of the selected hardware. Im- portant to note is that Clock Frequency when referring to GPUs is the CUDA core clock frequency. . . . . . . . . . . . . . . . . . . . . . 44 4.1 Benchmark settings, f_m means fused multi tensor, uf_s means un- fused single tensor, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 Benchmark 1, Comparison between different optimizer. . . . . . . . . 48 4.3 Benchmark 2, One-step benchmark time results . . . . . . . . . . . . 49 4.4 Benchmark 2, Different Adan kernel one-step benchmark results . . . 49 4.5 Active time ratio for different Adan kernel implementations . . . . . . 51 4.6 Benchmark 3, Typical kernel size and throughput for different Adan kernel implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.7 Benchmark 4, Mixed precision training result . . . . . . . . . . . . . . 53 4.8 Benchmark 5, Training using Hugging Face and LightSeq . . . . . . . 54 4.9 Benchmark 6, Data parallel training result . . . . . . . . . . . . . . . 57 4.10 one-step time consumption per GPU . . . . . . . . . . . . . . . . . . 58 4.11 Layer Time Measurement under 2 GPU . . . . . . . . . . . . . . . . . 59 4.12 Layer Time Measurement under 4 GPU . . . . . . . . . . . . . . . . . 60 4.13 Benchmark 7, Baseline and Mixed Precision performances for the Python XLA and TensorFlow implementations. Fine-tuning for 3 epochs with a batch size of 16 . . . . . . . . . . . . . . . . . . . . . . 60 4.14 Benchmark 7, Mixed Precision performance with different batch sizes for the two different implementations. Out-of-memory occurred for the XLA implementation at a batch size of 32. . . . . . . . . . . . . . 61 4.15 Parameters for the systematic benchmark. . . . . . . . . . . . . . . . 61 xv List of Tables 4.16 Results based on fine-tuning the BERT model using the IMDb dataset. First section shows the baseline performances for the different archi- tectures, with a batch size of 16 (16x8 for TPUv2/v3). Second section shows the mixed performance of the different hardware with a batch size of 16 (16x8 for TPUv2/v3). Third section features the most optimized implementation for each type of hardware. . . . . . . . . . 64 5.1 Original data from different optimization . . . . . . . . . . . . . . . . 68 5.2 Normlized data from different optimization . . . . . . . . . . . . . . . 68 5.3 Weight Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.4 Score Table, the lower the better . . . . . . . . . . . . . . . . . . . . 69 5.5 Performance on Volvo dataset . . . . . . . . . . . . . . . . . . . . . . 70 xvi 1 Introduction Machine learning (ML) has become increasingly popular in recent years as a power- ful tool for solving complex tasks in natural language processing (NLP), computer vision, and automation, etc. The emergence of impressive ML technologies such as ChatGPT [1] and DALLE [2], which showcased remarkable capabilities in early 2023, is due to developments in deep learning, a branch of ML. These advancements are based on the Transformer architecture proposed by Google [3]. The Transformer-based BERT model [4] is extensively utilized in the field of NLP. Efficiency gains can be observed by a wide range of other industries and companies. As outlined in a blog post by Google showing utilize BERT enhance the interpre- tation of user queries [5] to improve user experience. Examples such as the ones listed show that even for domain specific tasks, using the likes of transformer-based models can improve efficiency and performance. Some AI-based translation tools, such as DeepL, also use Transformer-based models. By customizing the network architecture to a certain extent, translation accuracy and speed between different languages are greatly improved, while maintaining low usage costs. The success of the Transformer architecture can be attributed to its ability to address the data dependency problem that arises in training recurrent neural network (RNN) commonly used in traditional NLP, which accelerates the transformer-based model’s training through parallelization [6]. In the training of RNN, the input of the next iteration depends on the output of the previous round, making it difficult to feed multiple tokens in the sequence into the network simultaneously for parallel training. However, Transformer eliminates this dependency by introducing the self-attention mechanism, which replaces the association between tokens with the self attention mechanism. However, the Transformer architecture is not without its flaws, one of which is its lower computational efficiency. Specifically, if the input token sequence has a length of n and each token embedding has dimension d, the overall complexity per layer of the self-attention mechanism is O(n2d). In comparison, the complexity per layer of CNN in the same case is O(knd2) (k is the number of convolution kernels) and RNN is O(nd2). This shows that the complexity of Transformer increases quadratically as the sequence length grows linearly. Therefore, it is essential to improve the compute capabilities for Transformer architecture from both hardware side and software side, especially when dealing with longer sequences. 1 1. Introduction 1.1 Machine Learning Hardware There are several companies offering various hardware prototypes that can accelerate the training of machine learning models. NVIDIA’s GPU is the most commonly used for accelerating CUDA-based applications like machine learning, thanks to its high performance and built-in CUDA core. Other companies like Intel and AMD also offer their own selection of GPUs suitable for accelerating deep learning. For example, Intel’s latest ARC series GPUs are expected to perform similarly to NVIDIA’s RTX series. Intel has implemented an abstraction of its latest GPU offering through its XPU heterogeneous computing platform, enabling it to run on mainstream machine learning frameworks like PyTorch. Similarly, AMD offers the Radeon family of GPUs suitable for ML using its open-source software called ROCm. In addition to GPUs, other types of hardware can accelerate machine learning, in- cluding FPGAs or specialized chips designed specifically for ML. For example, the Cerebras Wafer-Scale Engine (WSE) chip is a giant AI accelerator that utilizes a single wafer. It allows for reduced communication latency between cores and more shared on-chip memory, resulting in highly accelerated training and ultra-low la- tency inference [7]. Another example is that of Google’s Tensor Processing Unit (TPU), which will be covered in more details later on. 1.2 Machine Learning Framework Data scientists usually use a comprehensive deep learning framework to aid them in creating deep learning models. The most common frameworks in academia and industry are PyTorch [8] and Tensorflow [9]. The primary component of these frame- works is a data structure known as a Tensor. Torch providing exceptional support for GPUs, but it only offers an upper-level interface to the programming language Lua, which is not widely used. As a result, Facebook’s development team repro- grammed the upper-level interface in Python to provide full Python support, which now named PyTorch. PyTorch has gained popularity because of the widespread use of Python in the community. TensorFlow is a deep learning framework introduced by Google in 2015. It is com- parable to PyTorch since operations are performed on tensors. However, unlike PyTorch, TensorFlow is a static graph-based framework, whereas PyTorch is based on dynamic graphs. Both dynamic and static graphs have their advantages and disadvantages. Dynamic graphs are simple to use for model development and error detection. Nonetheless, they often have lower performance and cannot meet specific needs. Static graphs are quicker since they are designed for a series of optimizations to be made before execution. However, the optimized graphs lead to harder to detect errors during runtime making debugging and error solving more difficult. 1.2.1 Machine Learning Libraries In addition to hardware, high-performance software libraries are also important in achieving maximum computational performance. Different hardware requires 2 1. Introduction different libraries to fully exploit their potential performance. For example, NVIDIA provides the cuDNN high-performance machine learning software library, while Intel offers the OneDNN machine learning library. AMD provides the ROCm library for their hardware to maximize performance in machine learning tasks. These libraries use a variety of hardware optimization strategies to address performance bottlenecks that compilers cannot optimize. For instance, the double buffering technique proposed by Tian et al. [10] takes advantage of the fast warp switching feature of NVIDIA GPUs. This technique hides memory access latency on the previous warp by prefetching data for the other warp, allowing the scheduler to quickly switch the other warp to the idle stream processor when a memory access occurs on one warp. However, the increasing diversity of hardware in recent years has presented a significant challenge in the development of software libraries. While specific hardware designs can significantly improve hardware performance in a given domain, non-specialists may struggle to accelerate their tasks by writing high-performance kernels independently, as these efforts are usually limited to hardware vendors. 1.3 Transformer usage at Volvo In international companies like Volvo Group, many teams are trying to use artificial intelligence technology to expedite their business and reduce costs. The implemen- tation of Transformer-based neural network architecture is being pursued by many Volvo data scientists to build their own large language model and train a customized model on the internal corpus. This has sparked Volvo’s interest in speeding up training the Transformer model, which would allow AI technology to use in Volvo’s various business activities by reducing the cost of training large language models. At Volvo SML, BERT-based models are used in various tasks, such as predicting traffic codes by generating corresponding codes based on parts’ descriptions. This reduces the need for human labor and allows human effort to be focused on more crit- ical steps. Our goal is to optimize the Transformer architecture to reduce hardware usage and time costs associated with the model development workflow. 1.4 Problem statement Nowadays, the push for training a large model within academic researchers and industry engineers has increased, pre-trained models based on the Transformer ar- chitecture such as BERT are already prevalent in NLP-based tasks. However, they still require fine-tuning on domain specific datasets before they can effectively be in research or products. The fine-tuning process incurs both an initial cost but also a recurring cost in terms of monetary costs, time and energy, due to factors such as wanting to improve the model in the future with more data, or simply model degra- dation in which the models performance worsens with time. This thesis therefore aims to investigate the effectiveness of various hardware-aware optimization tech- niques such as kernel fusion, mixed precision, acceleration libraries and distributed 3 1. Introduction training techniques when fine-tuning a pre-trained BERT model. The results will address the following research questions: 1. How can the fine-tuning process of the BERT model benefit from hardware- aware optimization techniques in terms of time, energy consumption and mon- etary cost? 2. What are the benefits and drawbacks of using GPU-based training methods as opposed to a TPU-based training method? 1.5 Related Work Sparsh et al. [11] presents a survey focusing on the architecture and system-level optimizations on GPUs for DL applications. Reviewing techniques for inference and training as well as for using either single GPU or a distributed system. Highlighting that over the past nine years, GPU compute power has increased 32x, whilst the memory bandwidth only increased 13x whilst the DNN memory requirements grow at a rapid pace. Thus, coming to the conclusion that substantial improvements within memory capacity and bandwidth for GPUs are required to keep GPUs the main platform for DNNs. Most papers focused only on a single technique for increas- ing performance, however future work should use multiple techniques in conjunction and evaluate them. Further more, they state that a critical challenge in GPU use is large power consump- tion, and that nearly all papers surveyed ignored the energy efficiency assessment of their solutions that future workers should evaluate. As well as newer GPU mod- els and their low / mixed-precision techniques [12] for achieving higher efficiency. Moreover, they say it will be interesting to see how next-generation GPUs compare to custom-made AI accelerators such as TPUs. Finally, they focus on the lack of security when using GPUs, and that security needs to be a design parameter when designing GPUs and CNNs, not retrofitted. Shi et al. [13]present a comparative study of SOTA GPU-accelerated DL software tools (Caffe, TensorFlow, Torch, etc.), benchmarking tool performance using 3 pop- ular neural networks on two CPU and three GPU platforms. They also benchmark on a distributed system with multiple GPUs. Two contributions are made, one for end-users and one for software developers. First, by using their results as a guide for selecting hardware and software by end-users. Secondly, by showing software developers of DL tools possible future directions for performance optimization. They conclude by showing that the tools tested make good use of GPUs, and achieve significant speed-ups over CPU counterparts. However, there is no single software tool evaluated that can consistently perform better than another, so there should be opportunities for optimization in this area. Finally, they stated that there were two directions for future work. One is benchmarking DL software tools and hardware platforms, and the other is evaluating tool scalability on a high-performance GPU cluster. Sze et al. [14] discuss the challenges of energy consumption, throughput, and data 4 1. Introduction movement and how to address them at various levels of hardware design. They state that data movement is the cause of high energy consumption, and that re- cent research focuses on reducing memory movement while maintaining accuracy, throughput, and cost. They therefore advocate selecting architectures with efficient memory hierarchies and data flows that increase data reuse. They also mention that joint design between algorithms and hardware has been a driving force to reduce data movement requirements through reduced bit width precision, compression, and increased sparsity. They conclude that considering the interactions between different levels of hardware design, one example is reducing the cost of memory access, which could lead to more energy-efficient data flows. Wang et al. [15] conduct a study on the performance and energy efficiency of multiple hardware types (Intel CPU, NVIDIA GPU, AMD GPU, Google TPU) on a suite of representative DL workloads. They investigate the impact on performance and energy consumption based on many factors, such as hardware, software library for hardware, and deep learning framework. Their contribution is twofold, providing end users with useful results when selecting hardware for their own projects, and exposing opportunities for hardware manufacturers to improve their software. 1.6 Aim and Objectives The main aim is to analyze how hardware-aware optimization techniques such as Ker- nel Fusion, Multi-tensor Access, etc impacts the fine-tuning process when fine-tuning on the BERT Model. Secondarily, we also aim to investigate the performance dif- ferences between various hardware within two main architectures, GPUs and TPUs as well as compare these architectures directly. Based on these aims, the following goals are formulated: 1. Accelerating the Adan optimizer using Kernel fusion and Multi-tensor ac- cess techniques. Different hardware-related optimization techniques will be explored to determine the best implementation. The advantages of the high- performance kernel in terms of computation time and hardware utilization will be evaluated. 2. Accelerating the existing training process (Hugging Face baseline) using tech- niques such as mixed precision, distributed training and also using accelera- tion libraries such as LightSeq. The optimization process will be conducted in stages, and compared with the existing workflow, mainly evaluating the training time, energy consumption, and cost. 3. The existing workflow will be migrated to TPU-based training platform and compared with the GPU-based training platform, both in terms of baseline performance as well as optimized performance. The advantages and disadvan- tages of both methods will be analyzed. 4. Lastly, performing a systematic benchmark / grid search to evaluate a plethora of hardware, and how the hardware is impacted by the optimization techniques used in this thesis. 5 1. Introduction 1.7 Delimitations This thesis has one main limitation, access to physical hardware. Because of this, the thesis will be conducted in the cloud only, that is, on either Microsoft Azure or Google Cloud. The advantage of this is access to a wide selection of hardware that can be benchmarked, although at the disadvantage of having to use energy estimations rather than real energy measurements. However, even though a wide choice of hardware is available in the commercial mar- ket, our thesis is limited to the use of NVIDIA GPUs available in Volvo’s subscrip- tion on Azure and Google TPUs (v2 and v3). The TPUv4 would perhaps be a more fitting candidate for some of the benchmarks but is unavailable due to monetary reasons (Google Cloud offers TPUv2 and TPUv3 for free via their TPU Reserach Cloud initiative). Moreover, while pre-training would be an interesting area of re- search, the costs associated with repeating pre-training using cloud computing are considered too high. 1.8 Thesis Outline Theory: In this chapter, we provide a brief overview of relevant background liter- ature, including the Transformer and its key algorithms, optimization algorithms, GPGPU and TPU hardware architectures, LightSeq library, mixed precision train- ing, and distributed training. Method: This chapter outlines the current problems in the training process and presents different optimization methods for each problem to improve training per- formance. We also discuss the potential benefits of these optimization methods. Results: In this chapter, we present the results of our approach and briefly describe them. Discussion: This chapter synthesizes the results from the previous chapter, sum- marizes our findings, and provides recommendations. We also delve deeper into the optimization methods and training processes used to provide insights useful for future work. Future work: In this chapter, we suggest future work based on our findings as well as based on the limitations of the thesis. Conclusion: This chapter summarizes the thesis content and presents our conclu- sion based on the presented results. 6 2 Theory This chapter serves as an introduction to the underlying theory that forms the basis of this thesis. It will cover several key topics, including the Transformer architecture, deep learning optimizer theory, GPU and TPU architectures, the LightSeq software library, mixed precision training, and the concept of distributed training. By pro- viding an overview of these theories, the reader will gain a better understanding of the subsequent chapters of the thesis and how these theories relate to the research presented in this thesis. 2.1 Transformer Transformer is a popular machine learning architecture used for various tasks in nat- ural language processing. These tasks include sequence labeling, sequence classifica- tion, natural language generation, word sense disambiguation, etc. The Transformer architecture, with its superior semantic feature extraction and extended sequence processing capabilities, has enabled Transformer-based models to achieve state-of- the-art (SOTA) performance on numerous datasets. ERNIE-Doc-Large [16] achieved SOTA performance on the IMDB dataset text classification task with 97.1% accu- racy. On the Penn Treebank dataset [17] GPT-3 (Zero shot) achieved the SOTA test perplexity on language modelling of 20.5 [18]. This thesis focuses on sequence classification tasks using the Transformer-based BERT model. The Transformer architecture introduces a range of novel features and capabilities which will illustrated in the following sub-sections. 2.1.1 Self Attention Self-attention is a technique that allows the aggregation of all positions in an input sequence, weighing each position according to its importance. This approach sig- nificantly enhances a model’s ability to capture long-range dependencies within a sequence. The Transformer architecture uses this technique to achieve better perfor- mance than traditional sequence models such as recurrent neural networks (RNNs) [19] and long short-term memory (LSTM) [20] networks when dealing with long sequences. Moreover, the Transformer architecture does not rely on sequential com- putations over time, as opposed to RNNs and LSTMs. This feature allows efficient parallel computation during training, resulting in faster processing times. This ad- 7 2. Theory Figure 2.1: Example of correlation between query vector and key vector. vantage over traditional sequence models can significantly reduce the time required for training. There are three essential parameters for the attention mechanism, the Query, Key and Value matrices (Q, K, V) as depicted in the following equation, Attention(Q, K, V ) = softmax ( QKT √ dk ) V. (2.1) Notably, the Query, Key, and Value matrices are derived from the word embedding vectors through linear transformations. As illustrated, Embedding ×W Q = Q Embedding ×W K = K Embedding ×W V = V (2.2) where W Q, W K , W V are trainable parameters. Q matrix multiplied by the transpose of the K matrix produces the correlation matrix between the Query and Key. A higher correlation between Query and Key results in a larger corresponding value. This procedure is important because it allows the strength of the relationship between Query and Key to be determined in a structured and quantitative manner, allowing us to enhance the original word embedding. 8 2. Theory The correlation computation is shown in Figure 2.1, in order to simplify the expres- sion the original embeddings are used directly. This means that the Query matrix on the left and the Key matrix on the top have the same values, and both matrices undergo a linear transformation during actual computation. The yellow correlation matrix reflects the correlation between each token, with higher values indicating a stronger correlation with the current token, and therefore requiring a higher weight. These weights are then embedded as contextual information from the corpus into the original vector. Normalization of dimension is necessary to obtain reasonable values for relationships, and to maintain stability in the gradients, which is crucial for neural networks. Subsequently, the correlation matrix is multiplied by the Value matrix to obtain the attention-weighted word embedding representation. Attention mechanism aims to improve the learning process by identifying relevant tokens and strengthening their weights. This is accomplished by comparing associ- ations between sequence contexts and amplifying weights, resulting in an enhanced representation of the original word embeddings in the current context. 2.1.2 Multi-headed Attention Multi-head attention mechanism utilizes multiple sets of trainable matrices, namely W Q i , W K i , and W V i , to project input data into different subspaces. By initializing these matrices randomly, each head captures distinct semantic information, leading to diverse feature representations. Subsequently, the output features from different heads are concatenated and transformed by WO, which is the linear layer, to obtain the final embedding representation, as seen here: MultiHead(Q, K, V ) = Concat(head1, head2, ..., headh)W O headi = Attention(QW Q i , KW K i , V W V i ). (2.3) This approach facilitates the learning of distinct pieces of information in various representation spaces, which enhances the flexibility and efficiency of the model. The design is reminiscent of convolution neural networks, which employ a large number of convolution kernels to capture features in feature space by randomly initializing the values of different kernels. Increasing the number of heads in the Transformer model can enhance its ability to capture sequence features. Figure 2.2 shows that the multi-head attention mechanism exhibits a high degree of parallelism because the attention heads do not have interconnections and can therefore be computed in parallel. This property provides the model with increased flexibility and perceptual capabilities, while making efficient use of computational resources without encountering training obstacles. 9 2. Theory Figure 2.2: Multi Attention with three attention heads, vector a1 is word embedding and b1 headi is the attention vector generated by different attention heads, and these vectors were transformed by the linear layer. 2.1.3 Transformer Architecture Transformer architecture uses a conventional Encoder-Decoder structure. In par- ticular, the architecture consists of a stack of N Encoders on the left-hand side of Figure 2.3, where input elements are uncorrelated, allowing parallelization. On the right hand side, a stack of N Decoders is shown, where inputs rely on the previous output prediction token. This creates a temporal dependence between inputs, and prohibiting simultaneous output of prediction results. Transformer architecture simplifies the loop structure, which makes it impossible to directly capture position information in a sequence. To overcome this limitation, Transformer uses positional encoding to append positional information to each input vector, allowing the model to better grasp the sequential information in the sequence. The basic architecture is shown in Figure 2.3. Transformer architecture suffers from a significant drawback where identical tokens within the input corpus are assigned identical embeddings due to the absence of a loop structure. To address this limitation and ensure that the relevant positional in- formation of tokens in the sequence is captured, Vaswani et al. introduced positional encoding after the embedding stage. Position encoding is a technique used to incorporate positional information into sequence input embeddings. This technique involves adding an offset vector to the input embedding vector. The offset vector is generated by a specific function that generates a unique position encoding at different locations and time steps. The most commonly used functions for generating position encoding are the sine and cosine functions. These functions are effective in capturing relative position information between different locations in a sequence and representing it as a continuous vector. 10 2. Theory Figure 2.3: Transformer Architecture, on the left is stack of encoders, on the right is stack of decoders. Positional encoding with sine and cosine is done using the following equations PE(pos,2i) = sin ( pos 100002i/d ) PE(pos,2i+1) = cos ( pos 100002i/d ) . (2.4) The Transformer uses the sine function to encode the token in even positions and the cosine function to encode the position in odd positions. 2.1.4 BERT BERT used the Transformer encoder for stacking, with Google offering two variants: BERTbase, featuring a 12-layer encoder, and BERTlarge, containing a 24-layer en- coder. These models also exhibit different attention head quantities and different hidden layer parameters. BERTbase has 12 attention heads and 768 hidden layers, while BERTlarge has 16 attention heads with 1024 hidden layers. Figure 2.4 shows the architecture of the BERTbase model used for sequence classi- fication. It consists of three layers: Token embedding layer, Transformer encoder layer, and a task-specific output layer. The Token embedding layer is responsible for creating embeddings that contain both sentence and position information for existing tokens. The Transformer encoder layer, which is built by stacking multiple encoders, extracts semantic information from the input and passes it to the task-specific output layer. The task-specific output layer generates the final output based on the requirements of the downstream task. By selecting different outputs, various downstream tasks can be performed. For a sequence classification task, a dense layer is randomly initialized as classification header at the output of the encoder. 11 2. Theory Figure 2.4: BERTbase for sequence classification 2.1.5 Pre-train and Fine-tuning The pre-training procedure for BERT involves two primary components. The first component is called Masked Language Modeling (MLM), which involves masking a segment of words in the training corpus and training the model to predict missing words based on the context of surrounding words. This approach enhances the model’s ability to understand the semantics of masked words by leveraging the context of the words to the left and right of the masked word. This results in a bidirectional understanding of the language model, which allows it to comprehend language more effectively. The second component of BERT pre-training procedure is known as the Next Sen- tence Prediction (NSP) task. The NSP task involves training the model to predict whether sentence B follows sentence A in a given text, with a 50% probability that B is the next sentence. By training the model on this task, BERT is better able to understand the relationship between sentences in a document, which is essential for many natural language processing tasks, such as question answering and sen- timent analysis. The combination of MLM and NSP enables BERT to learn rich representations of language that can be fine-tuned for a wide range of downstream tasks. The term "fine tuning" refers to the process of optimizing a pre-trained model on a specific dataset and usually downstream task. This technique differs from pre- training, which involves training a model on a large dataset to learn general features that can be applied to various tasks. During this phase, the model architecture undergoes rapid changes. Once the model architecture is fixed, we can accelerate the training process by using a custom kernel. Therefore, determining the model architecture is critical to determining the optimization tools that can be applied. The primary goal of fine-tuning is to retain the knowledge acquired by the pre- trained model while adapting it to the target dataset and downstream task. To 12 2. Theory achieve this, several techniques are used to prevent the model from forgetting pre- viously learned features. For example, when doing fine-tuning on computer vision task e.g. VGG network [21], it may beneficial to freeze its feature extraction layer and train only the dense layer for classification on a new dataset. This approach reduces computational costs, but it also has the disadvantage of limiting the model’s ability to achieve high accuracy. The learning rate is a critical parameter in the model fine-tuning process, and com- mon techniques include lowering the learning rate, adjusting the optimizer’s β, using warm-up, and learning rate decay [22]. These methods aim to strike a balance be- tween model training stability and learning speed during the fine-tuning process. In this thesis, the words "training" and "fine-tuning" are used interchangeably, if any section, paragraph, etc refers to pre-training then "pre-training" is explicitly used. 2.2 Deep Learning Optimizer Deep learning optimizer is a set of algorithms used to adjust the weights and biases of a deep neural network throughout the training phase with the aim of reducing the error between the predicted and true values. Essentially, the main task of the optimizer is to determine the set of parameters that will produce the most accurate model predictions on the training data and to update the model parameters according to a selected algorithm. The optimization process involves iteratively modifying the model parameters by using the gradient of the loss function associated with those parameters. The opti- mizer determines the direction and magnitude of each parameter update, while the learning rate controls the size of the update. Various optimization algorithms use different techniques to compute parameter updates, which may affect the speed and accuracy of model training. 2.2.1 Stochastic Gradient Descent Stochastic gradient descent (SGD) [23] is a commonly used and straightforward method for optimizing model parameters in machine learning. Its popularity is mainly due to its intuitiveness and ease of implementation, making it a popular choice among researchers and engineers. Denoting the model parameter set at iteration t by θt, the fundamental concept behind stochastic gradient descent is illustrated by the equation θt+1 = θt − ηt∇L(θt) , (2.5) where ηt is the learning rate and ∇L(θt) denotes the gradient of the loss function with respect to the parameter vector θt, evaluated for the i-th sample randomly chosen from the training set. As machine learning progresses, researchers have discovered that the choice of learn- ing rate can significantly affect the convergence and dispersion of the model. If the 13 2. Theory learning rate is set too low, the model is guaranteed to converge, but the learning speed is impractically slow. Conversely, if the learning rate is set too high, the model can learn features from the training data more quickly, but it may struggle to converge to a local optimum. To solve this problem, researchers proposed dynamic learning rate algorithm [24], which is an extension of the stochastic gradient descent algorithm. This approach adjusts the learning rate as a function of the number of iterations, resulting in a schedule of learning rates. During the first iteration, the algorithm induces a large change in the parameters of the model, while subsequent iterations incur smaller and smaller changes. 2.2.2 Momentum Momentum is an optimization concept in machine learning that helps to accelerate the gradient decedent in the correct direction, thus speeding up the convergence rate. It is an extension of the gradient descent optimization algorithm that builds historical memory information in the search direction to overcome the local minima and oscillations of noisy gradients. Momentum is based on the same concept of momentum in physics, where a ball rolling down a hill builds enough momentum to jump out of the local minimum and bring it to a global minimum. The main idea behind momentum is to calculate an exponentially weighted average of historical gradients and use it to update the weights. By considering the past gradients, the gradient descent step becomes smooth, which reduces the amount of oscillations seen in the iterations. gt ← ∇L(θt−1) vt ← γvt−1 + ηtgt θt ← θt−1 − vt (2.6) Equation 2.6 shows how the momentum is involved in the gradient update. The gt represents the loss resulting from the calculation of the result from the current parameter versus the actual result, which is the gradient that the sample tells us to discard. The vt represents the momentum accumulation, i.e. the gradient of the current term is weighted and added to the gradient of the previous steps so that historical gradient information can be introduced to help the system maintain the correct gradient direction. Where γ represents the weight of the momentum in the previous steps and η represents the learning rate of the current gradient. Finally, the parameters are updated using the vt term to obtain the new parameters. 2.2.3 Adam The Adam optimizer combines the ideas of both momentum and adaptive learning rate, which allows the Adam optimizer to accelerate the training process of deep learning models and avoid some problems of SGD algorithm, such as the learning rate is difficult to adjust and easily fall into local optimal solutions. 14 2. Theory gt = ∇L(θt) θt = θt−1 − α · gt√∑t i=1 g2 t (2.7) The following discussion focuses on the concept of adaptive learning rate, which originated from the AdaGrad optimization algorithm. To address the limitations of the stochastic gradient descent (SGD) and momentum algorithms, AdaGrad was introduced by John Duchi et al. in 2011. AdaGrad enables the adjustment of differ- ent learning rates for each parameter and updates frequently changing parameters with smaller steps, while sparse parameters are updated in larger steps. The adaptive learning rate in AdaGrad is determined by the sum of the squares of past gradients gt up to the current time step t. If there are many parameters being updated in gt (non-sparse), the sum of the squares will be larger, leading to a decrease in the learning rate for the current θt. Conversely, if there are fewer parameters being updated in gt (sparse), the sum of the squares will be smaller, resulting in an acceleration of the update of the θt. This relationship is expressed in equation 2.7. gt = ∇L(θt) mt = β1 ·mt−1 + (1− β1) · gt vt = β2 · vt−1 + (1− β2) · g2 t m̂t = mt/(1− βt 1) v̂t = vt/(1− βt 2) θt = θt−1 − η√ v̂t + ϵ m̂t (2.8) The Adam algorithm workflow is presented in Equation 2.8. The algorithm incor- porates momentum and adaptive learning rates to accelerate the machine learning model’s training process and avoid local optima. The momentum term, mt, controls the gradient descent’s direction, preventing the model from falling into a local opti- mum. The vt term adjusts the learning rate, η, based on the gradient’s sparsity. By combining adaptive learning rates and momentum, Adam can optimize the model’s parameters efficiently. The weights of the first-order and second-order moment estimation are controlled by β1 and β2, respectively. β1 is the exponential decay rate of the first-order moment estimation, and its value is usually set to 0.9 to balance the weight of the historical gradient. Similarly, β2 is the exponential decay rate of the second-order moment estimation and is usually set to 0.999 to adjust the squared weight of the historical gradient. 15 2. Theory 2.2.4 Adan Xie et al. [25] recently proposed Adan, an optimization algorithm that employs the Nesterov momentum [26]. By leveraging newer weights, Adan is able to anticipate the direction of gradient descent and determine the final gradient update direction, which can potentially accelerate convergence compared to using momentum alone. However, a potential limitation of Nesterov momentum is the need to incorporate the yet-to-be-updated momentum into the network for propagation to the loss value, resulting in multiple recomputations. As a consequence, the full potential of Nes- terov momentum has not been realized, and further development of the method is required. mt = βmt−1 +∇L(θt−1 − αβmt−1) θt = θt−1 − αmt (2.9) Equation 2.9 requires the computation of ∇L to be performed beforehand to obtain the Loss value. However, it is possible to modify the equation so that the stored historical gradient can replace the calculation process without altering the result. mt = βmt−1 +∇L(θt−1) + β[∇L(θt−1)−∇L(θt−2)] θt = θi−t − αmt (2.10) Equation 2.9 can be modified to obtain equation 2.10. This modification is equiva- lent and does not affect the convergence of the training. Furthermore, it becomes apparent that ∇L(θt−1) − ∇L(θt−2) approximates the second-order derivative of the objective function. By leveraging this information the direction of the gradient change can be determined, thus achieving faster convergence compared to stochastic gradient descent. mt = β1mt−1 + (1− β1)gt vt = β2vt−1 + (1− β2)(gt − gt−1) nt = β3nt−1 + (1− β3)[gt + β2(gt − gt−1)]2 ηt = η/( √ nt + ϵ) θt+1 = (1 + λtη)−1[θt − ηt ◦ (mt + β2vt)] (2.11) The basic structure of Adan is presented in equation 2.11. Notably, the authors have implemented first-order and second-order moment estimations which are similar to those employed in Adam’s optimizer. To achieve Nesterov momentum, it is necessary to calculate with gt−1 subsequent to acquiring the gradient gt−1 − gt−2. 16 2. Theory Figure 2.5: GPU architecture example 2.3 Hardware Architecture This section is dedicated to explaining the design and ML application of existing hardware. How the hardware works, and what benefits it provides when training deep neural networks. 2.3.1 GPU The following section concentrates on the use of graphics processing units (GPU) and will not delve into traditional graphics processing units. Due to the advance- ment of a sophisticated programming model, general-purpose computing can now be implemented with GPUs. Contemporary GPUs feature a "manycore" design, known as Streaming Multiprocessors (SM) by NVIDIA and Compute Units (CU) by AMD. These SM cores execute a single instruction multiple thread (SIMT) program ini- tiated by a corresponding kernel. Additionally, multiple threads can be run on each SM, allowing for communication via scratchpad memory and synchronization through barrier operations. Each core typically contains a first-level instruction and data cache, which decreases the amount of data sent to lower memory levels. The vast number of threads available can be utilized to hide the latency caused by cache misses. Figure 2.5 shows a group of Streaming Multiprocessors or Compute Units that share the same cache and control unit. Such a group of compute units is hierarchical in the GPU programming model, where NVIDIA refers to it as warp and AMD refers to it as wavefront. SMs/CUs execute the same instruction using different data, which significantly enhances the data throughput of the instruction. In contrast, a CPU may not match this throughput gain, even with vector instructions. However, GPU’s throughput decreases substantially when a set of SMs/CUs pro- duces different branch in conditional statements. To handle branches, modern GPUs insert instructions back into the instruction stream and fetch them multiple times until all result were calculated, using a mask to prevent some threads from writing 17 2. Theory results to registers to complete the branching operation. As a result, an excessive number of branches in an instruction degrades the performance of the GPU. In contrast, a CPU has a more robust control unit that enhances its ability to handle instruction branches. The high degree of parallelism and the hierarchical memory architecture of GPU present possibilities for optimization of existing machine learning software. The pipeline structure of GPU compute core allows for commonly used pipeline opti- mization techniques such as loop unrolling and instruction rescheduling to improve pipeline performance. Additionally, special vector instructions can be employed to enhance data reading capabilities. The hierarchical cache structure enables caching of frequently used data, reducing unnecessary memory access and improving perfor- mance. Despite advancements in computer systems, the CPU remains at the center and is referred to as the host, while GPUs and other accelerators are regarded as devices, functioning as subordinate peripheral devices. To use a GPU, engineers must man- ually specify programs to run on it, which results in additional overhead. Therefore, reducing this overhead is a crucial topic for optimizing modern GPU programs. 2.3.2 TPU Tensor Processing Unit (TPU) [27] is a specialized accelerator for Tensor operations, which was introduced by Google in 2017. Its primary function is to facilitate the efficient execution of matrix operations that are pervasive in deep learning. The TPU is a custom designed chip that consists of several dedicated components, such as a matrix multiplication unit, an accumulator, a nonlinear activation unit, and on-chip buffer. The Matrix-multiply units (MXU) is the foundation of the TPU, which presents a different architecture compared to a GPU. While a GPU is often described as tempo- ral architecture, TPU is characterized as spatial architecture, both of which employ similar computing units and multiple units as computing components. TPU’s spa- tial design enables it to support a variety of datapaths, facilitating data exchange across multiple Processing Elements (PEs). The Systolic array is a classical architecture in spatial architecture that was first introduced by Kung et al. [28] . This architecture is characterized by its simplicity and low cost. The primary design goal of the systolic array is to reduce the overhead of data transfer between memories by extending the data flow time in the computing unit. The systolic array achieves this by using data as many times as possible in the process pipeline, fetching data once, and cascading multiple PEs to use the results of the previous level of operations as input to the current PE. As shown in Figure 2.6, the first data enters the first PE and is processed, then passed to the next PE. Meanwhile, the second data enters the first PE, and so on. By the time the first data reaches the last PE, it has been processed multiple times. This pulsating architecture actually reuses input data multiple times, resulting in high computing throughput while consuming less memory bandwidth. Upon expan- 18 2. Theory Figure 2.6: Principle of systolic array sion of the one-dimensional array shown in Figure 2.6 into a two-dimensional array, the resulting structure is recognized as the matrix multiplication unit (Figure 2.7) present in TPU. Figure 2.7: MXU architecture In contrast to GPUs, TPUs utilize a systolic array, which is a data-flow dependent computational architecture. This unique structure takes advantage of potential parallelism in computationally intensive tasks, allowing for high throughput even at lower clock frequencies. By calculating the dynamic energy consumption using the formula E = c · f · V 2 it’s possible to see that operating at a lower circuit frequency at a certain voltage leads to lower power consumption. Google’s research [29] shows that a computing cluster of 16 TPUv2 chips can consume 36% less energy 19 2. Theory compared to a cluster of 16 V100 32GB GPUs. The TPUv2 has a power consumption of 280W per chip and contains two MXU cores on each chip. The instance we have access to contains four TPUv2 chips, resulting in an available system power consumption of about 1120W. In comparison, the average power consumption of the NVIDIA V100 PCIe 16GB is about 250W per chip. In summary, the goal is to compare TPUs with GPUs in terms of their training time, and cost of use. Furthermore, to assess the level of complexity involved in transitioning from GPU training to TPU training, including aspects such as software reliability and usability. This comparison will give users more insight into choosing the proper configuration to accelerate their machine learning tasks. 2.4 Mixed Precision Training Mixed precision training [12] is a popular technique that exploits the low precision of FP16 to reduce computation and memory overhead while maintaining model accu- racy. The key idea is to store network parameters in the low-precision FP16 format while maintaining accuracy in the high-precision FP32 format. During training, the range of parameter values is usually kept within a limited range, which makes the use of FP16 parameters a viable option with little impact on accuracy. However, the gradient has the potential to suffer from numerical underflow, especially in the later stages of training, resulting in a loss of accuracy. The basic weight update process is illustrated in the Figure 2.8. Figure 2.8: Mixed precision training weight update To alleviate this problem, a technique called gradient scaling is introduced, where the FP16 gradient is multiplied by a loss scale mapped to a range of values suitable for FP16 operations to avoid overflow before being used for backpropagation. At the end of the backpropagation the gradient is divided by the loss scale so that the true value of the gradient can be restored. 20 2. Theory Figure 2.9: BF16, FP16 and FP32 formats. Different bit ranges for above three floating number representations. If the gradient is found to have underflowed during the update process, then the update is skipped. In the parameter update step, all updates are performed on the FP32 copies of the model parameters to ensure the stability of the values. This approach results in faster training time and less memory consumption due to the use of less accurate FP16 parameters, while the use of more accurate FP32 parameters ensures accurate parameter updates. FP16 and BF16 Mixed-precision training gains a significant advantage from low-precision floating- point representations, such as FP16 and BF16. FP32 and FP16 are commonly used in GPU training, while BF16 is preferred for Google’s TPU training. Both FP16 and BF16 use 16-bit data, with differences in the length of their mantissa and exponent, as illustrated in Figure 2.9. Note that BF16 has the same exponent bits as FP32, so they have the same representation range as FP32. Converting from generic FP32 data to BF16 is simpler, as the exponent of BF16 is the same as that of FP32, requiring no special processing. However, converting from FP32 to FP16 requires a special process due to the difference in their precision representation capabilities. Typically, FP32 needs to be rescaled to adapt to the precision of FP16. 2.5 LightSeq Library LightSeq, originally created for high performance Inference using Transformers has been extended, under the name of LightSeq to accelerate the training / fine-tuning process of Transformer-based models on GPUs [30], [31]. As previously stated, Transformer-based models use is widespread due to their stellar performance in many ML tasks with one of the most common approaches being fine-tuning the likes of large pre-trained models such as BERT, GPT, etc. 21 2. Theory LightSeq comes with a series of optimization techniques for GPUs specifically tai- lored for Transformer-based models computation and memory access flows in order to improve performance. LightSeq focuses on following optimization techniques: Computational Graph Optimization, Dependent Reduction Rewriting, Accelerated Mixed-Precision Update for Trainer and Dangling-Tensor Aware Memory Manager. Computational Graph Optimization - LightSeq aims to replace as many fine- grained GPU kernel operations implemented in PyTorch with coarse-grained fused ones, which reduces overhead caused by launching kernel functions. General matrix multiply (GEMM) operations such as linear transformation and scaled dot product use the implementations found in cuBLAS, whilst non-GEMM operations such as Dropout, LayerNorm and Softmax are re-implemented in LightSeq. Element-wise operations such as DropOut, ReLU, etc are fused into coarse-grained kernels if theyre adjacent in the computational graph of the Transformer. These operations also enable more optimizations in the shape of parallelism. Layers such as the embedding layer and criterion layer are also optimized with more parallelism, and Softmax is also reworked such that instead of being directly calculated, the logarithmic Softmax is calculated instead. Accelerated Mixed-Precision Update for Trainer - Parameters and gradients are FP16 during forward- and back-propagation when using mixed precision, how- ever the optimizer requires FP32 copies to ensure accuracy. Traditionally each gradi- ent or parameter is copied and stored as a FP32 copy, and then loaded when needed by the optimizer. This approach introduces two inefficiencies; mass-kernel launches when copying and updating which reduces GPU utilization as well as redundant memory footprints (storing FP16 and FP32 gradients/parameters) which reduces memory efficiency. LightSeq resolves these two inefficiencies by using symbolic ten- sor linking and on-the-fly conversion. When initializing the optimizer, parameters and gradients are copied into a separate workspace tensor, which is then linked to the optimizer. LightSeq only executes the optimizer update one time per training step to update the workspace, instead of for every updated parameter/gradient which re- duces kernel launches. When loading from the workspace, the parameters/gradients are loaded as FP16, then converted on-the-fly to FP32 in the registers to be updated. Once updated, theyre re-converted to FP16 before being stored in the workspace. This reduces the data movement by 50% and avoids redundant FP32 copies since all memory access is done using FP16. Dependent Reduction Rewriting - Batch reduction operations such as Layer- Norm and Softmax traditionally take a lot of time during training and are therefore a focus of optimization in LightSeq. Changes to how the batch reduction opera- tions of LayerNorm as well as re-arrangement of gradient calculation formula allows for parallel execution of the batch reduction operations. LightSeq stores parame- ters as FP16 floating points but converts them to FP32 during computation due to sensitivity to low precision of floating point numbers when using LayerNorm. Dangling-Tensor Aware Memory Manager - In general, large batch training leads to faster convergence and higher accuracy however due to memory limita- tions it requires either multiple GPUs, memory offload or gradient accumulation. 22 2. Theory LightSeq divides GPU memory into permanent and temporary, to store parameters and gradients as well as intermediate tensors respectively. During the initialization phase LightSeq scans the training set and estimates the capacity required, then the temporary memory is allocated once and reused by different batches before finally released at the end of training. Which reduces the memory footprint by compacting the memory and reducing the allocation and releases. 2.6 Distributed Training Distributed training is a technique used to address the problem of handling large volumes of data needed to train machine learning models, especially neural networks. Given the limited computing power of individual machines, it has become necessary to distribute machine learning workloads across multiple machines, transforming a centralized system into a distributed one. Two primary parallel approaches, data parallelism and model parallelism, are used to facilitate distributed training. These methods can be used independently or in combination. In addition, when working with distributed training, it is essential to consider the fault tolerance and scalability of the system. 2.6.1 Topology and Communication Parameter Server [32] and Ring All Reduce [33] algorithms are widely used in in- dustry, such as at Google or Microsoft. Parameter Server involves a master server and multiple worker nodes, and is therefore an asymmetric system where worker nodes obtain all parameters of the model from the master node through inter-node communication. Ring All Reduce utilizes a symmetric point-to-point system where all nodes are equal and form a ring structure. The transmission of node parameters in this system involves each node transmitting its parameters to the left node and receiving the remaining parameters from the right node. This process effectively re- duces network communication, allowing all nodes to obtain updated values of model parameters during the transmission process. There are two modes of inter-node communication when obtaining model param- eters: synchronous and asynchronous. In a symmetric system, where each node has comparable computational speed, synchronous communication is preferred to maintain the model’s accuracy. However, in systems with nodes of varying perfor- mance, synchronous communication can limit system throughput. In such scenarios, asynchronous communication can improve system throughput by reducing the wait- ing time of faster nodes. However, asynchronous communication introduces random noise (staleness), which represents an out-of-date direction of gradient descent. This noise can be equivalent to the momentum used in the optimizer [34], facilitating the parameter update process and preventing the model from falling into a local opti- mum. 23 2. Theory Figure 2.10: Distributed training topology, on the left is parameter server, on the right is ring organization. Figure 2.11: Data Parallel 2.6.2 Data Parallel The data parallel approach involves replicating model parameters on each node in a distributed system. To facilitate training, the training data is divided into N copies, where N is equal to the number of nodes in the system. During the training process, each node updates the gradient information based on the current batch of data. An update mechanism, such as the Parameter Server or Ring All Reduce, is used to ensure that these updates are propagated throughout the system and applied to all copies of the model parameters. As shown in the Figure 2.11, the training data is divided into four parts, and the entire model is stored on each device, using a quarter of the dataset for training, and updated parameters are transferred between the devices to ensure the gradient descent of the model parameters in the correct direction. The use of data parallelism is a popular technique in distributed training, especially when the model size is within the limits of a single GPU’s capacity. However, when the model size is too large to fit into a single GPU, alternative methods such as model parallelism or pipeline parallelism should be considered. These alternatives 24 2. Theory come with added communication and scheduling overhead. 2.6.3 Workflow The data parallel execution model currently in use operates as follows: The dis- tributed initiator broadcasts the training data from the data sitting node to the entire communication network via Broadcast. Subsequently, each GPU starts the forward, which does not require communication between GPUs because there is no data dependency. However, during the backward, gradient information from each layer needs to be exchanged with other devices so that the entire communication network obtains gra- dient accumulation information within the entire mini-batch. This step is known as reduce sum within the collective communication. In practice, communication is initi- ated as soon as a layer in the network completes backpropagation. This provides the advantage of overlapping communication with computation and completely masks the communication overhead. In this parallel process of computation and communication, the slower communica- tion process often determines the time overhead. This may result in completion of the calculation but waiting for the communication channel to become available, re- sulting in a cumulative delay in the backpropagation process. Once all devices have obtained the complete gradient information, averaging the gradient within the batch allows the optimizer to obtain the required gradient, and executing optimizer.step() completes the update process. Therefore, the bottleneck that affects the training speed throughout the training step is the communication time between GPUs and the computation time of backpropagation process. 2.6.4 Collective Communication Collective Communications refers to a global communication process involving all processes in a particular process group. It consists of fundamental operations such as send, receive, copy, synchronize Barrier, and the synchronization of processes between nodes (signal+wait). These fundamental operations are combined to form communication templates, also known as communication primitives. The challenge of collective communication lies in ensuring communication efficiency and optimal application of the network hardware connection topology. In the forward propagation of distributed training broadcast is mainly used to dis- tribute the training data to each node. Figure 2.12 shows the process of Broadcast where the data owning node will communicate with every other node, once and only once. Thus the communication overhead of Broadcast is N, the number of nodes in the N communication world. Allreduce is a collective communication operation that aggregates data from multiple processing units and combines this data into a global result, utilizing a designated operator. The resulting data is then distributed to all processing units. Allreduce performs in several stages. First, each participant disperses its vectors. Second, each 25 2. Theory Figure 2.12: Broadcast communication participant gathers vectors from other participants. Finally, each participant per- forms a preference operation on all accumulated vectors. By using various Allreduce sequences for different participants, highly complex computations can be distributed across many computational units. Allreduce is widely used by parallel applications in high-performance computing (HPC) that relate to scientific simulation and data analysis, such as machine learning computations, as well as the training phase of deep learning neural networks. Given the substantial growth of deep learning models and the complexity of scientific simulation tasks employing networks, effective implementation of Allreduce is critical to reducing communication time. Figure 2.13: Allreduce communication Assuming the existence of P processes and a total amount of N data. When the main process handles the data, the other processes send the data to the main process, resulting in (P − 1)N data received. After the main process finishes handling the data, it broadcasts the result to the other processes, resulting in (P − 1)N data sent. Therefore, the total amount of communication is 2(P − 1)N . This means that communication overhead will increase linearly with the increase in the number of P 26 2. Theory processes. In the Ring-AllReduce algorithm, the amount of communication per process can be calculated as follows. Each process must receive N P (P − 1) data from the process on its left and send the same amount of data to the process on its right. Thus, the total communication is 2N P (P−1). Chunking data into N P sizes distributes chunks to other processes, reducing communication bottlenecks. As the number of P processes increases, the total communication converges to 2N , indicating that communication does not increase as the number of nodes in communication increases. Therefore, the Ring-Allreduce algorithm is more efficient than the simple algorithm because it distributes computation and communication evenly among all participat- ing processes, eliminating bottlenecks. Ring-AllReduce is commonly used in many AllReduce implementations, especially for distributed deep learning workloads. 27 2. Theory 28 3 Methods This chapter explores various strategies for optimizing performance when fine-tuning BERT models. Specifically, we examined the Adan optimizer and implemented a hardware-optimized version using Kernel Fusion, Multi-tensor access and Instruc- tion Level Parallelism (ILP). In addition, this chapter investigates the effects on fine-tuning performance when using techniques such as mixed precision training, distributed training, and LightSeq’s acceleration library. Finally, all optimization techniques explored are used in a combined search with various hardware (NVIDIA A100, V100, A10, T4, and Google TPUv2, TPUv3) to find optimal solutions. 3.1 Fused Adan Adan optimizer is a novel optimizer that diverges from the widely used Adam op- timizer. It leverages the Nesterov momentum to accelerate the gradient descent algorithm, resulting in faster model training. As the algorithm is still in its early release, there is currently no optimized kernel available on the GPU. The implemen- tation of fused Adan will be introduced in the following subsection. 3.1.1 Profiling The current Adan kernel is not equipped with a high-performance fused kernel, and its original implementation is based on the tensor compute interface provided by Pytorch. Unfortunately, this implementation may result in potential performance losses. Therefore, the original Adan implementation is profiled to investigate poten- tial issues, see Figure 3.1 By measuring the time consumed by optimizer.step() during one optimization step it is possible to examine how much time is consumed for computation tasks and data preparation tasks. The optimizer single step profiling resulted in a total time of 232.866 ms. This result can be broken down into the two parts: computation and data preparation, which take 169.331 ms and 63.535 ms, respectively. The profiling reveals that there are 4422 kernel launches during one optimization step, with 804 kernels dedicated to data preparation and 3618 kernels used for com- putation. Additionally, a kernel launch time graph for this original implementation is provided in Figure 3.1. This figure illustrates that the kernel is arranged in a loose 29 3. Methods Figure 3.1: Unfused Adan kernel launch time graph with single tensor access. manner on the timeline which results in a significant amount of GPU time being wasted, showing potential opportunities for optimization. According to the profiling results the Adan optimizer takes a total computation time of 169331 µs. However, only 31, 293.63 µs is actually used for kernels running ac- tively, which implies that the computing units of the device are active only during this time. As a result, only 18.48% of the total time is spent on actual computa- tion. This suggests that there is a considerable amount of overhead that could be eliminated through optimization. 3.1.2 Kernel fusion Kernel fusion [35] is a method used to eliminate the extra overhead caused by mul- tiple kernel launches and might improve power efficiency on GPUs. By reusing the data which already loaded into register or shared memory, redundant data load/store could be reduced. A kernel is essentially a function call, such as torch.mul(), torch.add(). To illustrate, consider the example code below. For example, calcu- lating R = A×B + C×D using the PyTorch implementation requires dividing it into three tensor operation kernel launches. # Unfused version # Create Variable, Matrix size (512, 512) A = torch.rand(512, 512).to("cuda") B = torch.rand(512, 512).to("cuda") C = torch.rand(512, 512).to("cuda") D = torch.rand(512, 512).to("cuda") # Calculate R=A*B+C*D, with intermediate variables M, N M = torch.mul(A, B) N = torch.mul(C, D) R = torch.add(M, N) # Calculate R=A*B+C*D, Fused version fused_kernel(A, B, C, D, R) /*Following code is written in CUDA style*/ __global__ void fused_kernel(float *A, float *B, 30 3. Methods float *C, float *D, float *R) { int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; if (row < 512 && col < 512) { float sum = 0; for (int i = 0; i < 512; i++) { sum += A[row * 512 + i] * B[i * 512 + col]; sum += C[row * 512 + i] * D[i * 512 + col]; } R[row * 512 + col] = sum; } } In the code above torch.mul() is used twice and torch.add() once, with each function call incurring a GPU kernel launch, leading to increased overhead when allocating resources, instruction scheduling and data transfer. The fused Kernel combines two torch.mul() calls and one torch.add() call into a customized kernel, so that there only needs to be one kernel launch to perform the same operations as before. 3.1.3 Memory Hierarchy In modern GPUs, multiple levels of storage are available for general-purpose com- puting, which typically includes four levels: DRAM, scratchpad memory (shared memory), local memory, and register. To access data, a thread must first locate it in DRAM and then copy it to its own storage, such as register or local memory. Lo- cal memory is physically located in DRAM, but since addresses can be determined at compile time, they can be cached. Register is preferred for frequently accessed data because it has the fastest access speed and is closest to the compute unit. How- ever, register is a limited resource due to hardware constraints. Scratchpad memory can be used as an alternative and has similar access speeds to register. The fused kernel Adan implementation avoids excessive access to device DRAM by using registers to store frequently access variables, which can lead to memory walls. This technique can significantly improve the arithmetic intensity at DRAM level and reduce GPU idle time. 3.1.4 Vector Instruction Vector instructions are a type of instruction that operates on multiple data elements in a single operation, derived from the Single Instruction Multiple Data (SIMD) model. In addition to their use in multimedia processing, such as image and au- dio/video processing, scientific simulation, and machine learning, vector instructions are widely used in high-performance computing due to their ability to reduce the number of instructions needed to operate on large data sets, resulting in increased throughput. 31 3. Methods NVIDIA supports vector instructions in its Parallel Thread Execution (PTX) in- structions, such as vectorized Load and Store instructions. Using vector loading instructions reduces the total number of instructions, reduces latency, and improves bandwidth utilization. However, the corresponding instruction cycles are longer than non-vectorized instructions, making this trade-off beneficial for medium-performance workloads. /*0088*/ IMAD R10.CC, R3, R5, c[0x0][0x140] /*0090*/ IMAD.HI.X R11, R3, R5, c[0x0][0x144] /*0098*/ IMAD R8.CC, R3, R5, c[0x0][0x148] /*00a0*/ LD.E.64 R6, [R10] /*00a8*/ IMAD.HI.X R9, R3, R5, c[0x0][0x14c] /*00c8*/ ST.E.64 [R8], R6 As an example, the code above utilizes the IMAD instruction to compute the address of the vector load. It subsequently uses LD.E.64 to load two 32-bit floating-point numbers into the R6 register simultaneously. The IMAD instruction is then em- ployed again to calculate the data storage address, and the ST.E.64 instruction is used to store the data into the address indicated by the R8 register simultaneously. Vector loading can be utilized by using vector data types defined in the CUDA C/C++ standard header, such as int2, int4, or float4. When using the fused Adan single tensor access kernel, the float4 type is used for processing FP32 data. As a result, the compiler is enabled to execute vectorized load instructions, such as LD.E.128, to retrieve the target data in one go instead of employing four LD.E instructions for data access. The following code provides further insight into the fused Adan kernel. float4* p4_ptr = reinterpret_cast(p); /*Variable pointer...*/ float4* exp_avg_diff4_ptr = reinterpret_cast(exp_avg_diff); /*Vectorized load from global memory*/ float4 p4 = p4_ptr[global_id]; float4 g4 = g4_ptr[global_id]; const float4 neg_grad4 = neg_grad4_ptr[global_id]; float4 exp_avg4 = exp_avg4_ptr[global_id]; float4 exp_avg_sq4 = exp_avg_sq4_ptr[global_id]; float4 exp_avg_diff4 = exp_avg_diff4_ptr[global_id]; 3.1.5 Instruction Level Parallelism The concept of instruction-level parallelism plays a critical role in computer archi- tecture, enabling a higher number of instructions to be executed per cycle by lever- aging microprocessor pipeline architecture and minimizing pipeline bubbles through instruction scheduling. The fused adan multi tensor access kernel extends instruc- tion scheduling space by utilizing loop unrolling techniques, allowing the compiler to 32 3. Methods reorder instructions based on pipeline characteristics during the compilation stage for optimal hardware performance. Loop unrolling has several benefits, including reducing the overhead caused by branch prediction failures and minimizing the number of variable maintenance in- structions required for loop counters. For example, if the loop for(i = 0; i < 100; ), executing i+ = 2 instead of i+ = 1 reduces the number of ADD instructions required to maintain the variable i value from 100 to 50. In addition, unrolling the loop allows processing of two iterations at a time, opti- mizing instruction space to make full use of the pipeline to cover access latency and improve throughput, especially when dealing with time-consuming access tasks such as data loading and storage within the loop body. The provided code demonstrates the implementation of loop unrolling in fused Adan multi tensor access. Specifically, by setting the ILP to 4 for each thread, the loop is processed with four elements at a time. The compiler is instructed to perform loop unrolling using the #pragma unroll directive, which enhances both code readability and efficiency. #define ILP 4 #pragma unroll for(int ii = 0; ii < ILP; ii++) { int i = i_start + threadIdx.x + ii*blockDim.x; if(i < n && i < chunk_size) { r_p[ii] = p[i]; // r_ means register variable r_g[ii] = g[i]; r_exp_avg[ii] = exp_avg[i]; r_exp_avg_sq[ii] = exp_avg_sq[i]; r_exp_avg_diff[ii] = exp_avg_diff[i]; r_neg_grad_diff[ii] = neg_grad[i]; } else { //... } } 3.2 Multi Tensor Access Multi Tensor Access is a grouping technique that improves the device’s memory access efficiency and results in significant performance improvements. In situations where a multitude of small tensors exist, it is possible to have the process by aggre- gating the individual tensor accesses into a larger one, see Figure 3.2. This greatly simplifies the operation, leading to reduced kernel launches and global memory ac- cess, ultimately improving performance. 33 3. Methods Figure 3.2: Multi Tensor Access, with a chunk size of 3. Meaning that the tensors (red, yellow and green) are split into chunks of 3 for more efficient memory accesses. 3.2.1 Profiling Table 3.1 provides the SM and memory throughput for different kernel sizes when using the single tensor access fused Adan kernel. Grid Size Block Size SM Throughput Memory Throughput 21747 1024 39.45% 82.44% 2304 1024 37.34% 79.22% 576 1024 31.44% 62.96% 2 1024 0.78% 0.89% 1 1024 0.45% 0.50% Table 3.1: Typical kernel size with their SM/Memory throughput Figure 3.3 presents the SM and Memory throughput rate curves for running the Adan kernel. The results show that the use of single tensor access kernel results frequent kernel launches with smaller sizes, leading to a significant reduction in both SM and memory throughput. By observing the data in Table 3.1, a sharp decline in both computational and memory throughput is noticed as the grid size decreases. Therefore, it is beneficial to implement multi-tensor access to mitigate any potential variations caused by the kernel size. This will not only enhances the efficiency of the system but also ensures consistent performance. 34 3. Methods Figure 3.3: 1-Step benchmark displaying the GPU’s SM and memory throughput when using Single Tensor Access. 3.2.2 Multi Tensor Apply In order to use the Pytorch Optimizer interface, it is necessary to specify the foreach parameter that determines the mode in which parameters are accessed for optimiza- tion. When using a False foreach value, the optimization process iterates through each parameter tensor individually, leading to the sequential updating of these in- dividual tensors. However, when using a True foreach value, the optimizer groups the parameter tensors according to a defined chunk size before sending them to the kernel as a tensor list. To enable multi tensor access the tensors need to be packed before launching the kernel computations. To reduce startup overhead caused by many small tensors, multiple tensors are bundled into fixed-size chunks for computation, as depicted in Figure 3.2. for (int t = 0; t < ntensors; t++) { tl.sizes[loc_tensor_info] = tensor_lists[0][t].numel(); for (int d = 0; d < depth; d++) ... for (int chunk = 0; chunk < chunks_this_tensor; chunk++) { ... if (tensors_full || blocks_full || last_chunk) { multi_tensor_apply_kernel<<>>( chunk_size, noop_flag.DATA_PTR(), tl, callable, args...); } } } The first condition, tensors_full, requires the chunk sent to the kernel to be opti- mally sized for kernel operation. The second condition, blocks_full, ensures that the kernel operates optimally with the maximum allocated number of blocks. This prevents performance bottlenecks caused by too many or too few blocks. The last condition, last_chunk, handles tensors that cannot be processed under the first 35 3. Methods two conditions, ensuring that all tensors are calculated without missing. 3.3 Mixed Precision Training The current workflow does not utilize mixed-accuracy training, resulting in a missed opportunity for potential performance and energy benefits. The most time consum- ing GPU kernels during training are shown in Figure 3.4. Figure 3.4: Profiling showing the most time consuming GPU kernels, during one training step when using the Hugging Face model without mixed precision. In the realm of deep learning, the computation of matrix multiplication is acknowl- edged as a computation intensive kernel. Therefore, each of the Single precision General Matrix Multiply (sgemm) kernels illustrated in Figure 3.4 can be perceived as operators of high computational intensity. The sgemm kernel accounts for 93.6% of the overall time consumption in one training step. It follows that any acceleration in the execution of the sgemm kernel can lead to noteworthy enhancements in the overall training performance. Table 3.2 shows the execution time for some typical sgemm kernels with different matrix size. Kernel Mean Duration/µs volta_sgemm_128x64_nt 1274 volta_sgemm_128x32_nn 1205 volta_sgemm_32x128_tn 743 volta_sgemm_128x32_tn 3113 volta_sgemm_128x32_sliced1x4_nt 3033 Table 3.2: Average execution time of sgemm kernels with different size when not using mixed precision. 3.3.1 Enable Mixed Precision Training By introducing automatic mixed precision training (AMP) into the training flow, the process can be optimized in terms of time and energy consumption for better 36 3. Methods results. This thesis will leverage Hugging Face’s acceleration library to automate scaling of digital ranges and FP32-FP16 conversion. Minor modifications to the training loop are all that’s required to enable automatic mixed precision training. from accelerate import Accelerator accelerator = Accelerator() model, optimizer, train_loader, scheduler = accelerator.prepare( model, optimizer, train_loader, scheduler) ... # Code in trainer for step, batch in enumerate(self.train_dataloader): batch = {k: v.to(self.device) for k, v in batch.items()} self.model.zero_grad() outputs = self.model(**batch) loss = outputs.loss self.accelerator.backward(loss) # original: loss.backward() # ... The code above demonstrates the initialization process required before using the accelerator. In this code, the loss parameter is passed to the backward function provided by the accelerator. To ensure proper functioning, it is crucial to initialize the model beforehand. During the initialization process, the accelerator object es- tablishes internal links to the original model parameters. Once the loss is passed in, the FP16 model optimized by the accelerator can be used after range scaling. Mixed precision training offers two main advantages, which are briefly described below, 1. Reduced memory usage: FP16/BF16 is half the length of FP32, meaning that more data can be transferred during training. Which may enable more data to be used per batch during training. However, depending on the imple- mentation, mixed precision might yield higher requirements. E.g. Hugging Face’s implementation of mixed precision stores FP32 copies of model param- eters as well as stores gradients in FP32, whilst also storing the FP16 values for the model parameters. Meaning that more memory is actually required for storing model parameters and gradients. In particular, Hugging Face’s implementation requires 4 bytes per model parameter as well as 4 bytes per gradient during FP32 training. Whilst requiring 6 bytes per model parameter and 4 bytes per gradient during mixed precision. This is unlike the LightSeq implementation, which avoids storing FP32 copies, and instead uses on-the-fly FP16 <-> FP32 conversion. Meaning that Light- Seq reduces memory requirements by only storing 2 bytes per model parame- ters and even 2 bytes per gradient. Which yields, both a speedup since every access is FP16 as well as lower memory requirements. 2. More computationally efficient: Special AI acceleration chips such as the GPUs of NVIDIA V100 and Google TPU can execute operations faster using FP16 than FP32. In NVIDIA’s GPUs, FP32 data is converted to FP16 data 37 3. Methods in SMs, then fed into the tensor core for accelerated operations through a specific FP32-FP16 data workflow. 3.4 LightSeq Training Apart from the computationally intensive operations in the Transformer Encoder, other operations such as transpose, bias addition, softmax, and data transfer be- tween the device and host may also impact performance. To enhance the fine-tuning process, it is of interest to minimize the potential overhead caused by these opera- tions. The current implementation of the BERT model by Hugging Face prioritizes ad- hering to the original structure of the model rather than focusing on performance optimization. As a result, Hugging Face’s implementation provides a stable and convenient API interface for data scientists to use, however there are many opti- mizations that can be done to increase performance. Table 3.3 shows the average execution time for Softmax and LayerNorm kernel in original Hugging Face imple- mentation. Furthermore, the time consumption of encoder implemented Hugging Face in both the forward as well as backward without mixed precision is also found in the table. Kernel Average Duration softmax_warp_forward 497 µs softmax_warp_backward 712 µs vectorized_layer_norm_kernel 80 µs encoder forward 14.8 ms encoder backward 25 ms Table 3.3: Average execution time of common kernels within Hugging Face model. LightSeq offers an alternative implementation of the Hugging Face Transformer, which surpasses Hugging Face’s BERT model implementation in terms of perfor- mance. This is accomplished by sacrific